Every repository with this icon (
Every repository with this icon (
IMWOverview
What is a Dataset?
A dataset is information about some topic organized in a tabular
fashion (rows and columns).1 The dataset might be stored in a variety
of formats which don’t appear to be tabular (XML, YAML, etc.) but it
can always be “thought of” as tabular.
Datasets come to a group or an institution from a variety of sources and the format and structure of the datasets from these various sources is seldom the same. Processing a dataset to transform only what’s needed from it into a usable format is not a difficult task but it is an annoying one fit for a chimp and not a real researcher. Dataset processing should become an automated task, and that’s where IMW comes in.
The IMW tool can process many different datasets and requires only
that each dataset have a unique identifier (a descriptive name works
quite well). Datasets at a particular installation can be placed into
an ontology of categories if desired though this is completely
optional.
The IMW Workflow
The processing of a dataset can be split into four distinct stages which must be followed in order. These stages define a workflow for data processing and IMW can be viewed as a tool for automating this workflow across a collection of datasets. The stages, in order, are:
- Obtaining the dataset
- Parsing the dataset into some intermediate format
- Cleaning and normalizing the dataset
- Packaging the dataset into some final format
The IMW works by reading simple directives for each one of these steps and peforming the necessary actions. We follow Ruby’s convention over configuration principle and require the user to specify only the directives which are peculiar to the particular dataset being processed and hope that IMW will intelligently guess correctly elsewhere.
During execution of the workflow, intermediate copies of the dataset will be made and stored in directories reserved for the dataset. These directories default to ripd/dataset_source_url, rawd/category/subcat/dataset, fixd/category/subcat/dataset, and pkgd/category/subcat/dataset, respectively in order above, where the category file hierarchy is (optionally) defined by the user, and all paths are relative to the imw_data_root directory, (optionally) specified by the user (see directories for more).
Using IMW
In executing the above workflow for a dataset, IMW will (possibly) need to venture out onto the Internet to collect data, create (local or remote) directories to store ripped, raw, fixed, and packaged data, and write files in these directories (and possibly others) which contain the data in various stages of processing as well as output from IMW itself.
IMW will also need information about the particulars of the dataset it is being asked to process. Where is the data coming from? What format is it in? Which fields are important to keep? What is the nature of those fields? Does the original data source ever change, requiring periodic incremental updates of the final data output? Are there security concerns in the workflow that require authentication or encryption?
IMW will do its best to guess sensible defaults for much of the information it needs but some information will always have to be supplied by the user. This is done through YAML-formatted configuration files2. Each dataset will have two such YAML files: one for describing the workflow required in processing the data, and one for describing the structure (schema) of the dataset itself3. The IMW tool will be fed these YAML configuration files and will churn away, processing data while we noble Infochimps that created it sit back and pluck mites out of each others coats.
1 While we intend the notion of a “dataset” to be quite general,
there are many simple structures that cannot be thought of in a
tabular way: family trees, filesystems, etc.
2 Is it ironic that a data manipulation tool like IMW requires configuration files in only one particular data format? Yes, it is ironic. Now shush.
3 For simple datasets, the YAML schema file can be incorporated inline in the YAML workflow file.







