Every repository with this icon (
Every repository with this icon (
Rip
Ripping is the process of obtaining data from an online source (or
sources) and aggregating it together in an archive. This is the first
step in the IMW Workflow and is required before the data
can be processed any further.
The format of a dataset is different than the source: an RSS feed
is most likely coming from the Web through HTTP but XML
files could equally well come via HTTP, FTP, or a local disk.
Dataset sources
The sources from which IMW can1 obtain datasets are listed below
along with the parameters required for each as well as IMW’s default
guesses in parentheses. [dataset] means the name of the particular
dataset that IMW is processing. Empty parentheses imply that IMW has
no way of guessing the value of that parameter and the user will be
prompted for action if appropriate. Required parameters are in bold.
- Local disk
- directory (./[dataset]/*)
- SCP
- server ( )
- username ([current user])
- password ()
- port (22)
- directory (~/[dataset]/*)
- FTP
- server ( )
- username (anonymous)
- password ()
- port (21)
- directory (/[dataset]/*)
- HTTP
- domain ( )
- directory (/*)
- Database
- type (MySQL)
- username ([current user])
- password ()
- database ([dataset])
- tables ([all])
- query ()
- output format (CSV)
Once a source is specified, IMW will download files and place them in
the dataset’s ripd directory.
Downloading options
Some datasets are static and need only be downloaded once (example:
2004 voter turnout by age). Other datasets are dynamic and need to be
continuously updated as they change (example: daily closing price of
NYSE by stock).
Dataset types
The formats that IMW
- HTML: a website or a collection of websites (HTML tables are a good example)
- Feed: an RSS or Atom feed
- CSV: comma-separated values
- XML: XML or XHTML2
- YAML
- JSON
- Excel
- SQL: a file containing either SQL to populate a database or SQL queries to
run on a database to produce results which are the dataset
1 “can” should be read as: will
2 Should XHTML more properly belong to HTML instead of XML?







