Roles Database Data Feed Programs

Last modified 6/04/2010

Connecting to the Roles production or test server

When maintaining or debugging data feed programs, connect to either roles.mit.edu or roles-test.mit.edu as user rolesdb.

The crontab file

On the production Roles server (roles.mit.edu, aka cloverleaf) and the test Roles server (roles-test.mit.edu, aka parsley), there are several sets of data feed jobs that are automatically run each day. The overall schedule can be found in the crontab file:

If you ever need to change the crontab file, do the following:

  1. Connect to the server machine using telnet or SSH as user rolesdb.
  2. Go to the cronjobs directory
    cd cronjobs
  3. Check out the crontab file from RCS using the alias "checkout"
    checkout crontab.cloverleaf
    (or)
    checkout crontab.parsley
  4. Use emacs or vi to make the desired changes
  5. Check in the new crontab file into RCS using the alias "checkin"
    checkin crontab.cloverleaf
    (or)
    checkin crontab.parsley
  6. *** Don't forget this step! ** Run the crontab command.
    crontab crontab.cloverleaf
    (or)
    crontab crontab.parsley

You can display the current crontab entry with the command crontab -l

Shell scripts run by the crontab file

Each crontab file runs several shell scripts that, in turn, run individual programs. These high-level shell scripts include:

morning_jobs_early

- Extracts most data from the Warehouse, including most types of Qualifiers, and does some processing for the Master Department Hierarchy
 (Runs in early morning)

morning_jobs_late

- Runs some steps that depend on PERSON data from the Warehouse including (a) loading the PERSON table from krb_person@warehouse, (b) loading EHS-related Room Set data from the Warehouse into RSET qualifiers, (c) processing externally-derived authorizations (table EXTERNAL_AUTH).
(Runs in the morning, not quite so early)

cron_run_exception_notify

- Generates Email about Authorizations that still exist for people with deactivated Kerberos usernames
(Runs after morning_jobs_late)

weekend_jobs

- Runs a procedure to run an Oracle ANALYZE on all the tables in the Role DB.
(Runs each Saturday.)

hourly_jobs

- Currently only runs once a day, not hourly. Updates derived database tables ("shadow" tables) for the Master Department Hierarchy

cron_roles_cleanup_archive

- Cleans up some old files in the archive directory

cron_run_sapfeed

- Runs a job to create several daily files about SAP-related authorizations and uses the scp command to copy them to a directory on one of the SAP servers where they are picked up, further processed, and loaded into the appropriate SAP objects. The files built and sent are incremental, including authorization information only for those people whose "expanded" authorizations (after including all child qualifiers) have changed since the previous run.
(Runs once each morning, only on the production
Roles server)

cron_pdorg_prog.sh

- Compares SAP-related approver authorizations in the Roles DB with parallel information in the pd org structures in SAP, and generates a file indicating differences.
(Runs once each morning)

cron_pddiff_feed.sh

- Sends the file pdorg_roles.compare, generated from cron_pdorg_prog.sh, to the SAP dropbox
(Runs each morning after cron_pdorg_prog.sh)

cron_ehs_extract run_ehs_role_prog.pl

- Runs ~rolesdb/bin/ehs/run_ehs_role_prog.pl to compare DLC-level EHS roles (e.g., DEPARTMENTAL EHS COORDINATOR) to their equivalent Authorizations in the Roles DB (which is the system of record), generates a differences file, and sends it to the SAP dropbox so the changes can be applied.

Directories

Directory

Description

~rolesdb/archive

Some compressed historical files from previous days' data feed runs

~rolesdb/bin

Generic data feed perl scripts and other program files

~rolesdb/bin/ehs

EHS-related data feed programs

~rolesdb/bin/extract

Programs related to out-going data for DACCA, LDS (SAP component being phased out),  etc.

~rolesdb/bin/pdorg

Programs related to out-going data for updating PD Org entries in SAP related to APPROVER authorizations

~rolesdb/bin/repa_feed

Temporary or test versions of programs

~rolesdb/bin/roles_feed

Most data feed programs for data coming into the Roles DB

~rolesdb/data

Data files used by data feed programs. Most data files are temporary, but some, such as roles_person_extra.dat, are permanent.

~rolesdb/doc

Miscellaneous notes and documentation

~rolesdb/extract

Empty

~rolesdb/lib

A few generic perl modules, and some config files

~rolesdb/log

Most recent log files from data feed programs

~rolesdb/misc

Miscellaneous notes and working files

~rolesdb/sap_feed

Obsolete versions of Roles->SAP data feed programs

~rolesdb/sql

SQL source files for creating tables, views, and stored procedures. Files for creating tables (new_schema*.sql) are preserved for documentation purposes and should NOT be rerun -- tables should never be dropped and recreated since we do not want to lose the data. Files for creating stored procedures and views can be modified and rerun.

~rolesdb/sql/frequently_run_scripts

Special SQL scripts that are run periodically, e.g., to analyze tables

~rolesdb/sap_feed

Obsolete versions of Roles->SAP data feed programs

~rolesdb/xsap_feed/bin

Programs for Roles->SAP data feed programs

~rolesdb/xsap_feed/config

Config files for Roles->SAP data feed programs

~rolesdb/xsap_feed/data

Nightly data for Roles->SAP data feed programs

Extract, Prepare, and Load steps

Most data feed programs are perl modules for maintaining one type of data in Roles DB tables, such as people in the PERSON table or one type of Qualifier (e.g., Funds/Funds Centers) in the QUALIFIER table. Each perl module has a separate subroutine for the Extract, Prepare, and Load step.

The steps do the following:

Extract

Extract a full set of data for a particular type of object from the external source, generally the Data Warehouse. The data are written to a flat file in the ~/data directory, generally a file with a name ending in ".warehouse". The "warehouse" suffix is used even if the source of the data is something other than the Warehouse.

Prepare

(a) Select a full set of parallel data for a particular type of object from Roles DB tables into a flat file in the ~/data directory with a name ending in ".roles".
(b) Compare the data from the Extract and Prepare steps and produce a flat file "*.actions" listing actions to be applied to the Roles DB tables to synchronize them with the source data.

Load

Apply the actions from the previous step's "*.actions" table to actually update the data in the Roles DB tables

Running Extract, Prepare, and Load steps by hand

Normally, the 3 steps of most data feed processes are run automatically by shell scripts (~/cronjobs/early_morning_jobs, ~/cronjobs/late_morning_jobs, ~/cronjobs/evening_jobs, etc.)

However, when debugging or correcting a problem, it is possible to run any or all steps manually, and check the results after each step. There are two ways to do this:

It is often useful to follow this sequence

  1. Run the Extract step to see if errors occur, and if not, look at the *.warehouse file in the ~/data directory to examine the intermediate results.
  2. Run the Prepare step to produce a "*.actions" file, and look at the resulting file. You can also use an editor to modify lines in the actions file, e.g., to remove one or more problematic lines that are preventing the Load step from running properly.
  3. Run the Load step to apply the modified set of actions.

The most common problem with nightly feeds is that there may be too many changes since the previous night, exceeding the max-records setting for the particular object type. We will describe techniques for dealing with this below.

Example 1:
To run the steps of the EHS PIs, Room Sets, and Rooms feed program on the production Roles database, do the following - Connect to roles.mit.edu as user rolesdb

Example 2:
To run the steps of the person feed program on the production Roles database, do the following - Connect to roles.mit.edu as user rolesdb

Notification Email

There are three different ways that notification Email get generated by the Roles DB nightly feed programs.

  1. Messages about Authorizations for inactive Kerberos usernames

If an Authorization exists where the person (Kerberos_name column) represents a username that is no longer active (no longer included in the KRB_PERSON table from the Warehouse), then notification Email will be sent.

The people who should be notified are specified by setting up authorizations in the Roles Database where the function "NOTIFICATION - INACTIVE USERS" and the qualifier represents the function_category (application area) for which this person should receive notification.

The program ~/bin/roles_feed/exception_mail.pl finds all categories (column function_category in the authorization table) for which authorizations exist for deactivated Kerberos_names. The program then sends Email to the appropriate recipients (based on NOTIFICATION - INACTIVE USERS authorizations), one piece of Email per recipient for each function_category where there are one or more authorizations for inactive Kerberos_names. The Email lists the inactive usernames and the number of authorizations in the category that should be deleted or reassigned.

2. Errors detected by various data feed programs

Some errors are detected by various data feed programs, usually in the LOAD step, that result in Email being sent to a list of Email addresses stored in the file ~/lib/roles_notify. Currently, the Email addresses included are (the list) warehouse@mit.edu and repa@mit.edu.

3. Full log file send from various data feed programs

The full log from LOAD step of most data feed programs, and the full log from other data feed programs that do not have a separate LOAD step are sent to one or more Email recipients. Within the cronjobs directory, the scripts morning_jobs_early, morning_jobs_late, evening_jobs, and weekend_jobs include steps that send out this Email. Currently repa@mit.edu is the only recipient.

These log files do not need to be examined every day. It is useful to periodically examine them for warning messages. It is also useful to have them available if a problem is detected. However, usually the Email sent out for detected error messages is sufficient for detecting problems.

Data feed errors that can occur - Most common errors

Solution: Examine the data changes (see the section "Running Extract
Prepare, and Load steps by hand"). Determine if the
changes are legitimate due to source data or system problems.
If the changes are legitimate, increase the appropriate
value in the ROLES_PARAMETERS table (see description
in the "Adjusting max-action per day in the ROLES_PARAMETERS
table" section below in this document), and either rerun
the 3 data feed steps by hand, or wait for tommorow
morning's cronjob.
o EXTRACT step failed due to a network or server problem
and there was no data for the PREPARE and LOAD steps

Adjusting max-actions per day in the ROLES_PARAMETERS table

From rolesweb.mit.edu, click on "System Administrator tools", then click on Update Roles DB parameters for data feeds and other processes.

Other Roles system administrators' documentation

See rolesweb.mit.edu/sys_admin_tasks.html for documentation on various system administrators tasks.