Every repository with this icon (
Every repository with this icon (
Configuration
The configuration for Coupler is YAML-based. It has three main sections:
resources, transformations, and
scenarios. Once you have a configuration setup, you can run Coupler via its Scripts.
Resources
Here’s where you tell Coupler how to connect to your databases. The
configuration is similar to Rails’ config/database.yml file. Here’s the
layout:
resources:
- name: <string>
connection:
adapter: <string> # mysql or sqlite3
database: <string>
host: <string> # only necessary for mysql
password: <string> # only necessary for mysql
username: <string> # only necessary for mysql
table:
name: <string> # name of the database table
primary key: <string> # primary key field for this table
The name attribute must be unique across resources. The database attribute
is a database name for mysql and a filename for sqlite3, just like in Rails.
It’s important to note that there must be a resource specification for scores
and scratch. The scratch resource is used as an intermediate step and
should be treated as a temporary database. The scores resource is used to
keep scores of pairs in. These two resources should probably point to two
different databases, and they should be local for performance reasons. The
table section is not used when specifying the scores and scratch resources.
You also need a resource for each database table you want to link together.
Here’s an example of what a resources section might look like:
resources:
- name: my_data
connection:
adapter: sqlite3
database: my_data.sqlite3
table:
name: records
primary key: id
- name: scratch
connection:
adapter: mysql
host: localhost
username: scratch
password: secret
database: scratch
- name: scores
connection:
adapter: mysql
host: localhost
username: scores
password: secret
database: scores
Transformations
Transformers are methods used to change your data. If you want to convert a
date to a string, for example, you would use a transformer. You can also
create new fields through transformers. There are two sections to this:
functions and resources.
Here’s the specification:
transformations:
functions:
– name:
parameters: [ param1, param2, … ]
sql: <string|map>
ruby:
type:
resources::
– field:
function:
arguments:
:
:
…
Functions
You can use this section to define your own transformer methods. There are
also stock transformer methods, so if those fill your needs you don’t have to
include this section (see Stock Transformers).
Each transformer function you create must have a unique name (across
transformer functions). There also must be one or more parameters, which is
a sequence of strings. Each parameter must be unique (across parameters).
Parameters are used in formulas (see next paragraph). type is the database
type that this transformer will return.
The two ways you can specify formulas are through ruby and sql. ruby is
Ruby code you can use to change your data, where you can use parameters like
regular variables. sql can either be a string or a mapping. If sql is a
string, it is treated as a SQL expression regardless of the database adapter
you’re using (MySQL or SQLite3). If you want a different formula for each
database adapter, you can make sql a mapping, like so:
sql: mysql: <some mysql-specific formula> sqlite3: <some sqlite3-specific formula> default: <use this formula for all other adapters>
If there is no SQL formula listed for an adapter, Coupler will use ruby
instead and transform the fields via Ruby instead. Coupler will always try to
use a SQL formula first. Don’t use ruby when a simple SQL formula will
suffice.
I know that all might sound confusing, so here’s an example. Let’s say I have
a date field in one of my tables that I want to convert to a string before
matching. I would create a transformer function that looks like this:
- name: date_to_string parameters: [ date ] sql: DATE_FORMAT(date, "%Y%m%d") type: varchar(8)
This transformer has one parameter, date. The sql formula in this case
will convert a date like June 19, 2008 to “20080619”.
Let’s say I instead wanted to filter out certain values based on a regular
expression. I would create a transformer like so:
- name: filter_bar
parameters: [ some_string ]
ruby: "some_string =~ /\d+.*bar/ ? some_string : nil"
sql:
mysql: "IF(some_string REGEXP '[0-9]+.*bar', some_string, NULL)"
This transformer takes a string as an argument. If the resource being
transformed is using MySQL, then Coupler will use the mysql formula above.
If the resource is using something other than MySQL (such as SQLite3), Coupler
will use the ruby formula.
Resources
In this section you connect your resources to your transformer functions. The
field attribute here is the resulting field name you wish to create. It can
be the same as a field you want to transform. For example, if you had this:
transformations:
functions:
- name: date_to_string
parameters: [ date ]
sql: DATE_FORMAT(date, "%Y%m%d")
type: varchar(8)
resources:
foo:
- field: SomeDateField
transformer: date_to_string
arguments:
date: SomeDateField
It would convert the field SomeDateField in the foo resource to a string,
and then store that result in a field with the same name in the scratch
resource. It’s important to note that your original resource is never
changed. Note that you can also specify stock transformers here.
Stock Transformers
Here’s the list of stock transformers (which is currently short):
- trimmer – trims whitespace from a column
- renamer – renames a column to another name (basically a no-op)
- downcaser – lowercases a string column
Each of these transformers takes one argument, called from. For example:
transformations:
resources:
foo:
- field: trimmed_bar
transformer: trimmer
arguments:
from: bar
- field: used_to_be_blah
transformer: renamer
arguments:
from: blah
- field: lowercase_test
transformer: downcaser
arguments:
from: uppercase_test
Scenarios
Scenarios are the glue of Coupler’s configuration. Here is where resources and
transformations come together. The layout looks like this:
scenarios:
- name: <string>
type: <type> # self-join or dual-join
resource: <name> # name of resource (use in self-join mode)
resources: [<name>, <name>] # names of resources (use in dual-join mode)
matchers:
- field: <string> # name of database field or transformer field
type: <type> # only 'exact' is accepted at present (optional)
scores: [<low>, <high>] # used for 'exact' scoring; low is the
# non-matching score (by default, [0, 100])
formula: <string> # used for non-exact matches; should return a score
scoring:
range: <range> # a range, e.g. 50..100
combining method: <type> # 'mean' or 'sum'
Coupler can either produce a CSV file of scores or leave the scores in a
database table. The name of each scenario determines the name of either the
CSV file or the database table, whichever mode you happen to be using.
resource or resources is the name(s) of the resource(s) you want to join.
The matchers portion is where you specify which fields you want to compare.
field is the name of the field. If you want an exact match, you should use
the type attribute. Otherwise, you can use your own formula, where you can
use a and b to refer to values like so: a + b < 50 ? 30 : 70. The
formula is Ruby code and should return a score. Please note that using your
own scoring method is currently very slow. If you’re using exact matching
and wish to have custom scores, you can use the scores attribute to specify
that.
Finally, in the scoring portion you tell Coupler what range of scores you
consider to be a match. All other scores will be excluded. The
combining method attribute tells Coupler how to combine scores (either ‘mean’
or ‘sum’ for now).
This configuration format is very much still in development.
Related pages: Terminology, Scripts






