public
Description: Python module that allows you to easily write and run Hadoop programs.
Home | Edit | New

Running programs

The dumbo command mentioned below is the one installed by the optional installation step described in Building and installing. If you only installed Dumbo as part of Hadoop, then you:

  • cannot run programs locally on UNIX,
  • have to omit the -hadoop option when you execute contrib/dumbo/bin/dumbo to start a distributed run of a program.

Locally on UNIX

If you completed both the mandatory and optional installation steps described in Building and installing, then you can run a Dumbo program program.py locally as follows:

$ dumbo start program.py -input input.txt -output output.txt
$ dumbo cat output.txt | more

Other useful options for local runs are:

  • -inputformat <name of inputformat> (“text” by default)
  • -input <additional input path>
  • -python <python command to use> (“python” by default)
  • -libegg <path to egg> (this egg gets put in the Python path)
  • -cmdenv <env var name>=<value>
  • -pv yes (use “pv” to display progress info)
  • -addpath yes (replace each input key by a tuple consisting of the path of the corresponding input file and the original key)
  • -fake yes (fake run, only prints the underlying shell commands but does not actually execute them)
  • -memlimit <number of bytes> (set an upper limit on the amount of memory that can be used)

Distributed on Hadoop

To run a program on Hadoop, you basically just have to add the -hadoop option:

$ dumbo start program.py -hadoop <path to local hadoop> \
-input <DFS input path> -output <DFS output path>
$ dumbo cat <DFS output path> -hadoop <path to local hadoop> | more
Other useful options are (see also the Hadoop streaming page and wiki):

  • -inputformat <name of inputformat (class)> (“auto” by default)
  • -input <additional DFS input path>
  • -python <python command to use on nodes> (“python” by default)
  • -name <job name> (“program.py” by default)
  • -numMapTasks <number>
  • -numReduceTasks <number> (no sorting or reducing will take place if this is 0)
  • -priority <priority value> (“NORMAL” by default)
  • -libjar <path to jar> (this jar gets put in the class path)
  • -libegg <path to egg> (this egg gets put in the Python path)
  • -file <local file> (this file will be put in the dir where the python program gets executed)
  • -cacheFile hdfs://<host>:<fs_port>/<path to file>#<link name> (a link ”<link name>” to the given file will be in the dir)
  • -cacheArchive hdfs://<host>:<fs_port>/<path to jar>#<link name> (link points to dir that contains files from given jar)
  • -cmdenv <env var name>=<value>
  • -hadoopconf <property name>=<value>
  • -addpath yes (replace each input key by a tuple consisting of the path of the corresponding input file and the original key)
  • -fake yes (fake run, only prints the underlying shell commands but does not actually execute them)
  • -memlimit <number of bytes> (set an upper limit on the amount of memory that can be used)
Last edited by klbostee, Wed May 27 09:03:51 -0700 2009
Home | Edit | New
Versions: