public
Description: Ruby application for linking databases that don't have consistent identifiers
Home | Edit | New

Configuration

The configuration for Coupler is YAML-based. It has three main sections:
resources, transformations, and
scenarios. Once you have a configuration setup, you can run Coupler via its Scripts.

Resources

Here’s where you tell Coupler how to connect to your databases. The
configuration is similar to Rails’ config/database.yml file. Here’s the
layout:

resources:
  - name: <string>
    connection:
      adapter: <string>     # mysql or sqlite3
      database: <string>
      host: <string>        # only necessary for mysql
      password: <string>    # only necessary for mysql
      username: <string>    # only necessary for mysql
      table:
        name: <string>        # name of the database table
        primary key: <string> # primary key field for this table

The name attribute must be unique across resources. The database attribute
is a database name for mysql and a filename for sqlite3, just like in Rails.

It’s important to note that there must be a resource specification for scores
and scratch. The scratch resource is used as an intermediate step and
should be treated as a temporary database. The scores resource is used to
keep scores of pairs in. These two resources should probably point to two
different databases, and they should be local for performance reasons. The
table section is not used when specifying the scores and scratch resources.

You also need a resource for each database table you want to link together.

Here’s an example of what a resources section might look like:

resources:
  - name: my_data
    connection:
      adapter: sqlite3
      database: my_data.sqlite3
      table:
        name: records
        primary key: id
  - name: scratch
    connection:
      adapter: mysql
      host: localhost
      username: scratch
      password: secret
      database: scratch
  - name: scores
    connection:
      adapter: mysql
      host: localhost
      username: scores
      password: secret
      database: scores

Transformations

Transformers are methods used to change your data. If you want to convert a
date to a string, for example, you would use a transformer. You can also
create new fields through transformers. There are two sections to this:
functions and resources.

Here’s the specification:


transformations:
functions:
– name:
parameters: [ param1, param2, … ]
sql: <string|map>
ruby:
type:
resources: :
– field:
function:
arguments:
:
:

Functions

You can use this section to define your own transformer methods. There are
also stock transformer methods, so if those fill your needs you don’t have to
include this section (see Stock Transformers).

Each transformer function you create must have a unique name (across
transformer functions). There also must be one or more parameters, which is
a sequence of strings. Each parameter must be unique (across parameters).
Parameters are used in formulas (see next paragraph). type is the database
type that this transformer will return.

The two ways you can specify formulas are through ruby and sql. ruby is
Ruby code you can use to change your data, where you can use parameters like
regular variables. sql can either be a string or a mapping. If sql is a
string, it is treated as a SQL expression regardless of the database adapter
you’re using (MySQL or SQLite3). If you want a different formula for each
database adapter, you can make sql a mapping, like so:

sql:
  mysql: <some mysql-specific formula>
  sqlite3: <some sqlite3-specific formula>
  default: <use this formula for all other adapters>

If there is no SQL formula listed for an adapter, Coupler will use ruby
instead and transform the fields via Ruby instead. Coupler will always try to
use a SQL formula first. Don’t use ruby when a simple SQL formula will
suffice.

I know that all might sound confusing, so here’s an example. Let’s say I have
a date field in one of my tables that I want to convert to a string before
matching. I would create a transformer function that looks like this:

- name: date_to_string
  parameters: [ date ]
  sql: DATE_FORMAT(date, "%Y%m%d")
  type: varchar(8)

This transformer has one parameter, date. The sql formula in this case
will convert a date like June 19, 2008 to “20080619”.

Let’s say I instead wanted to filter out certain values based on a regular
expression. I would create a transformer like so:

- name: filter_bar
  parameters: [ some_string ]
  ruby: "some_string =~ /\d+.*bar/ ? some_string : nil"
  sql:
    mysql: "IF(some_string REGEXP '[0-9]+.*bar', some_string, NULL)"

This transformer takes a string as an argument. If the resource being
transformed is using MySQL, then Coupler will use the mysql formula above.
If the resource is using something other than MySQL (such as SQLite3), Coupler
will use the ruby formula.

Resources

In this section you connect your resources to your transformer functions. The
field attribute here is the resulting field name you wish to create. It can
be the same as a field you want to transform. For example, if you had this:

transformations:
  functions:
    - name: date_to_string
      parameters: [ date ]
      sql: DATE_FORMAT(date, "%Y%m%d")
      type: varchar(8)
  resources:
    foo:
      - field: SomeDateField
        transformer: date_to_string
        arguments:
          date: SomeDateField

It would convert the field SomeDateField in the foo resource to a string,
and then store that result in a field with the same name in the scratch
resource. It’s important to note that your original resource is never
changed
. Note that you can also specify stock transformers here.

Stock Transformers

Here’s the list of stock transformers (which is currently short):

  • trimmer – trims whitespace from a column
  • renamer – renames a column to another name (basically a no-op)
  • downcaser – lowercases a string column

Each of these transformers takes one argument, called from. For example:

transformations:
  resources:
    foo:
      - field: trimmed_bar
        transformer: trimmer
        arguments:
          from: bar
      - field: used_to_be_blah
        transformer: renamer
        arguments:
          from: blah
      - field: lowercase_test
        transformer: downcaser
        arguments:
          from: uppercase_test

Scenarios

Scenarios are the glue of Coupler’s configuration. Here is where resources and
transformations come together. The layout looks like this:

 scenarios:
  - name: <string>
    type: <type>                # self-join or dual-join
    resource: <name>            # name of resource (use in self-join mode)
    resources: [<name>, <name>] # names of resources (use in dual-join mode)
    matchers:
      - field: <string>         # name of database field or transformer field
        type: <type>            # only 'exact' is accepted at present (optional)
        scores: [<low>, <high>] # used for 'exact' scoring; low is the
                                #   non-matching score (by default, [0, 100])
        formula: <string>       # used for non-exact matches; should return a score
    scoring:
      range: <range>            # a range, e.g. 50..100
      combining method: <type>  # 'mean' or 'sum'

Coupler can either produce a CSV file of scores or leave the scores in a
database table. The name of each scenario determines the name of either the
CSV file or the database table, whichever mode you happen to be using.
resource or resources is the name(s) of the resource(s) you want to join.

The matchers portion is where you specify which fields you want to compare.
field is the name of the field. If you want an exact match, you should use
the type attribute. Otherwise, you can use your own formula, where you can
use a and b to refer to values like so: a + b < 50 ? 30 : 70. The
formula is Ruby code and should return a score. Please note that using your
own scoring method is currently very slow. If you’re using exact matching
and wish to have custom scores, you can use the scores attribute to specify
that.

Finally, in the scoring portion you tell Coupler what range of scores you
consider to be a match. All other scores will be excluded. The
combining method attribute tells Coupler how to combine scores (either ‘mean’
or ‘sum’ for now).

This configuration format is very much still in development.

Related pages: Terminology, Scripts

Last edited by viking, Fri Sep 05 13:43:57 -0700 2008
Home | Edit | New
Versions: