public
Description: Infinite Monkeywrench - A frameworks for collecting, peeling, and sharing delicious bananas of data.
Home | Edit | New

Rip

Ripping is the process of obtaining data from an online source (or
sources) and aggregating it together in an archive. This is the first
step in the IMW Workflow and is required before the data
can be processed any further.

The format of a dataset is different than the source: an RSS feed
is most likely coming from the Web through HTTP but XML
files could equally well come via HTTP, FTP, or a local disk.

Dataset sources

The sources from which IMW can1 obtain datasets are listed below
along with the parameters required for each as well as IMW’s default
guesses in parentheses. [dataset] means the name of the particular
dataset that IMW is processing. Empty parentheses imply that IMW has
no way of guessing the value of that parameter and the user will be
prompted for action if appropriate. Required parameters are in bold.

  • Local disk
    • directory (./[dataset]/*)
  • SCP
    • server ( )
    • username ([current user])
    • password ()
    • port (22)
    • directory (~/[dataset]/*)
  • FTP
    • server ( )
    • username (anonymous)
    • password ()
    • port (21)
    • directory (/[dataset]/*)
  • HTTP
    • domain ( )
    • directory (/*)
  • Database
    • type (MySQL)
    • username ([current user])
    • password ()
    • database ([dataset])
    • tables ([all])
    • query ()
    • output format (CSV)

Once a source is specified, IMW will download files and place them in
the dataset’s ripd directory.

Downloading options

Some datasets are static and need only be downloaded once (example:
2004 voter turnout by age). Other datasets are dynamic and need to be
continuously updated as they change (example: daily closing price of
NYSE by stock).

Dataset types

The formats that IMW

  • HTML: a website or a collection of websites (HTML tables are a good example)
  • Feed: an RSS or Atom feed
  • CSV: comma-separated values
  • XML: XML or XHTML2
  • YAML
  • JSON
  • Excel
  • SQL: a file containing either SQL to populate a database or SQL queries to
    run on a database to produce results which are the dataset

1 “can” should be read as: will

2 Should XHTML more properly belong to HTML instead of XML?

Last edited by dhruvbansal, Wed Jun 11 15:29:43 -0700 2008
Home | Edit | New
Versions: