Mentors: Bill Rideout and Bob Schaefer

Developments in computers and electronics are revolutionizing the field of radio science. In the past, big parabolic dishes would point at a single direction in space, recording data from only a very narrow frequency band, because of the limitations of electronics. Recent advances have allowed radio science instruments to become vastly more powerful, both in computing power and in the ability of electronics to convert large sections of the radio frequency band directly to digital data. At MIT, we are involved with a new radio science instrument called RAPID, where a single parabolic dish is replaced by a large number of inexpensive antennas, each independently recording a large swath of the radio frequency band. With this approach, the antenna is in effect pointing everywhere at once and listening to many frequencies at once. But with this large amount of data comes a challenging problem - how do we efficiently process all this data to find the science it contains?

In this project, two students will investigate a software tool to help us deal with this data. Apache OODT (Object Oriented Data Technology) is an open source project originally developed by engineers at NASA's Jet Propulsion Laboratory. Apache OODT is tied to the python programming language, which the students will also learn in order to fully test its features. In the first part of this project, the students will work together to install and understand how to use OODT in combination with python. In the second part, each student will test using OODT to solve a particular data processing problem. The goal of this project is to understand to what degree Apache OODT can help us solve the big data problem associated with RAPID, and perhaps in the process to also provide feedback to the developers of Apache OODT on any issues discovered.

Task list

  • Learn python

  • Learn Eclipse

  • Learn the basics of Xml. Try this tutorial or google Xml tutorial.

  • Learn about the RAPID project (see overview)

  • Learn the OODT python module

  • Solve a test problem using OODT

  • Possibly compare to solving the same problem without OODT

Learning OODT

Joining the OODT mailing list

Sign up to the user@ and dev@ mailing lists. To do that, send blank emails from their preferred subscribe address to:

The general approach will be to ask me about a problem - if I can't answer it, send an email to the user-subscribe@oodt.apache.org mailing list. The dev-subscribe@oodt.apache.org mailing list we will only use if we have suggestions for changing the code.

Installation of OODT on your local machines

  • I installed OODT as best I could on your local machines under /home/pre-col[1,3]/brideout/OODT/OODT-0.5. You should have write permission for that entire directory, because you will ultimately be testing you code by adding it there.

To test the installation, go to http://localhost:8080/my-curator on your browser.

Starting up OODT if you reboot

  • cd /home/pre-col[1,3]/brideout/OODT/OODT-0.5/apache-tomcat-6.0.37/bin

  • ./startup.sh

Projects

Your projects will be to add new metadata parsers to the web example. Basically you will be writing python scripts that get metadata from files, and convert it into an xml file, just like in the CAS Curation Guide. You can test it outside OODT by giving your script one argument - the filename to parse. But ultimately the goal is to make your code run inside OODT, just like the mp3 example.

Your python script will always take one argument: the full path to the input file. It will always write to an output xml file called the same name as the input file + .met .

Here's the xml produced by the example for an mp3 file:

<?xml version="1.0" encoding="UTF-8" standalone="no"?> 
<cas:metadata xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
  <keyval type="vector">
    <key>FileLocation</key>
    <val>%2Fhome%2Fpre-col1%2Fbrideout%2FOODT%2FOODT-0.5%2Fstaging%2Fproducts%2Fmp3%2F</val>
  </keyval>
  <keyval type="vector">
    <key>ProductType</key>
    <val>MP3</val>
  </keyval>
  <keyval type="vector">
    <key>Filename</key>
    <val>Bach-SuiteNo2.mp3</val>
  </keyval>
</cas:metadata>

 

Your xml file will follow the same pattern, but will have more than just the keys in the example above. The three keys above (FileLocation, ProductType, and Filename) are required.

Note that the mp3 file was: /home/pre-col1/brideout/OODT/OODT-0.5/staging/products/mp3/Bach-SuiteNo2.mp3

FileLocation is the directory the file is stored in, and Filename is the base name of the file. Use the python module os.path to get the directory name and base name of a file.

Note also that the character / was replaced with %2F in the xml file above. This is to make the text compatible with xml, because / has a special meaning in xml. Use the python replace method to do this in your code.

Here's a simplified version of the code used in that demo:

 

"""This is a simple demo script that follows the rules for extracting OODT metadata.

It is based on the MP3 example in 
http://oodt.apache.org/components/maven/curator/user/basic.html
but hard-coded to be even simpler
"""

import os, os.path, sys

usage = 'oodtDemo.py <filename>'

if len(sys.argv) != 2:
    raise ValueError, 'wrong number of arguments; usage is <%s>' % (usage)

if not os.access(sys.argv[1], os.R_OK):
    raise IOError, 'file %s not found' % (sys.argv[1])

# get some basic info about the file
fullPath = sys.argv[1]
fileName = os.path.basename(fullPath)
fileLocation = os.path.dirname(fullPath)
productType = "MP3"

# replace / with %2F
fileLocation = fileLocation.replace('/', '%2F')

template = """<?xml version="1.0" encoding="UTF-8" standalone="no"?> 
<cas:metadata xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
  <keyval type="vector">
    <key>FileLocation</key>
    <val>%s</val>
  </keyval>
  <keyval type="vector">
    <key>ProductType</key>
    <val>%s</val>
  </keyval>
  <keyval type="vector">
    <key>Filename</key>
    <val>%s</val>
  </keyval>
</cas:metadata>"""

f = open(fileName + '.met', 'w')
f.write(template %(fileLocation, productType, fileName))
f.close()

 

Project 1: Analyzing configuration files: Kevin Matos Salgado

In this project you will parse a configuration file. You will dynamically determine the fields to create, and generate the proper xml file.

A configuration file has the form:

[Madrigal]

# top-level Madrigal documents and scripts are copied
# to the following absolute directories
MADSERVERDOCABS = /usr/local/apache2/htdocs/madrigal
MADSERVERCGIABS = /usr/local/apache2/cgi-bin/madrigal

 

Lines either begin with a #, in which case they are ignored, or with a name such as MADSERVERCGIABS. That name becomes a new key, and the part after the = is the val. You can use the python ConfigParser module to parse the file.

For the output xml file, the key will be the section name and the key name separated by a colon, for example: Madrigal:madserverdocabs. Here's the expected xml file this example should produce id it was called test.ini:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>  
<cas:metadata xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
  <keyval type="vector">
    <key>FileLocation</key>
    <val>%2FUsers%2Fbrideout%2FDocuments%2Fworkspace%2FexamplePython</val>
  </keyval>
  <keyval type="vector">
    <key>ProductType</key>
    <val>Config file</val>
  </keyval>
  <keyval type="vector">
    <key>Filename</key>
    <val>test.ini</val>
  </keyval>
    <keyval type="vector">
    <key>Madrigal:madserverdocabs</key>
    <val>/usr/local/apache2/htdocs/madrigal</val>
  </keyval>
  <keyval type="vector">
    <key>Madrigal:madservercgiabs</key>
    <val>/usr/local/apache2/cgi-bin/madrigal</val>
  </keyval>

</cas:metadata>
Optional addition

In the python page we ended by analyzing a file with columns of data, where we came up with medians, averages, minimums and maximums. In this project, we will do the same analysis to a data file with column data. You will have keys like the following:

  • column_1_average

  • column_1_minimum

  • column_1_median

  • column_1_maximum

  • column_2_average

  • column_2_minimum

  • column_2_median

  • column_2_maximum

and so on with all the columns.

Project 1: Madrigal Hdf5 data file: Ricardo J. Rodriguez Garcia

In this project you will open a Madrigal Hdf5 data file. You will dynamically determine the the parameters it contains, and the date range of the data. You will use the python h5py module to get information from the binary Hdf5 file. You can download an example Hdf5 file here. If you want to browse inside the Hdf5 file, run /home/pre-col[1,3]/brideout/bin/hdfview, and then open the Hdf5 file.

The key/values pairs you be using will be:

  • parameters: a comma separated list of the parameters in the Hdf5 file

  • startDate: a time in the form "2013-01-12 00:005:59"

  • endDate: a time in the same form

Here's the expected xml output from the example Hdf5 file:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>  
<cas:metadata xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
  <keyval type="vector">
    <key>FileLocation</key>
    <val>%2FUsers%2Fbrideout%2FDocuments%2Fworkspace%2FexamplePython</val>
  </keyval>
  <keyval type="vector">
    <key>ProductType</key>
    <val>Config file</val>
  </keyval>
  <keyval type="vector">
    <key>Filename</key>
    <val>mlh130112g.001</val>
  </keyval>
    <keyval type="vector">
    <key>parameters</key>
<val>YEAR,MONTH,DAY,HOUR,MIN,SEC,UT1_UNIX,UT2_UNIX,RECNO,RANGE,AZ1,AZ2,EL1,EL2,PL,SYSTMP,PNRMD,POWER,MDTYP,PULF,DTAU,IPP,TFREQ,VTX,DVTX,SCNTYP,CYCN,POSN,MRESL,SNP3,WCHSQ,GFIT,FPI_DATAQUAL,TI,DTI,TR,DTR,POPL,DPOPL,PH+,DPH+,FA,DFA,CO,DCO,PM,DPM,VO,DVO,VDOPP,DVDOPP,TIBF,DTIBF,TRBF,DTRBF,FIT_TYPE,CCTITR,CCTIPH,CCTICO,CCTRPH,GDLAT,GLON,GDALT,NE,DNE</val>
  </keyval>
  <keyval type="vector">
    <key>startDate</key>
    <val>2013-01-12 00:005:59</val>
  </keyval>
<keyval type="vector">
    <key>endDate</key>
    <val>2013-01-12 23:17:49</val>
  </keyval>

</cas:metadata>

 

Other optional metadata that can be added:

  • For each parameter, give the maximum. The key would be <parameter name>_maximum. Here's an example (the number is just a guess):

    <keyval type="vector"> 
        <key>range_maximum</key>
        <val>1200.0</val>
      </keyval>
  • For each parameter, give the minimum. The key would be <parameter name>_minimum. Here's an example (the number is just a guess):

    <keyval type="vector"> 
        <key>range_maximum</key>
        <val>80.0</val>
      </keyval>
  • For each parameter, give the average. The key would be <parameter name>_average. Here's an example (the number is just a guess):

    <keyval type="vector"> 
        <key>range_average</key>
        <val>500.0</val>
      </keyval>
  • For each parameter, give the median. The key would be <parameter name>_median. Here's an example (the number is just a guess):

    <keyval type="vector"> 
        <key>range_median</key>
        <val>520.0</val>
      </keyval>
  • Make every row in Metadata/Experiment Parameters a metadata field, instead of just startDate and endDate. The key is the "name" column, and the value is the "value" column.

Optional python code analysis

If you have time, you can write an entirely separate python script that will create metadata from looking at python files themselves. I will give you only the high level outline of how to do this, and you can fill in the details.

The first step is to find a tool that analyzes python code. One example is pylint, but you can look at a few and decide which you like best. You will need to download these tools and install them on your computer to test them. Ask Ching if you need help.

Once you have chosen a tool, you will need to choose what metadata keys you want to produce. You will need to chose them based on what your tool produces.

Finally, you will need to write your python script to read an input python file and produce the correct .met file.

  • No labels