public
Description: Python module that allows you to easily write and run Hadoop programs.
Home | Edit | New

Running programs

Important remark for people who are still using Dumbo 0.20: The dumbo command mentioned below is the one installed by the optional installation step described in Building and installing. If you only installed Dumbo as part of Hadoop, then you:

  • cannot run programs locally on UNIX,
  • have to omit the -hadoop option when you execute contrib/dumbo/bin/dumbo to start a distributed run of a program.

Locally on UNIX

If you completed both the mandatory and optional installation steps described in Building and installing, then you can run a Dumbo program program.py locally as follows:

$ dumbo start program.py -input input.txt -output output.txt
$ dumbo cat output.txt | more

Other useful options for local runs are:

  • -inputformat <name of inputformat> (“text” by default)
  • -input <additional input path>
  • -python <python command to use> (“python” by default)
  • -libegg <path to egg> (this egg gets put in the Python path)
  • -cmdenv <env var name>=<value>
  • -pv yes (use “pv” to display progress info)
  • -addpath yes (replace each input key by a tuple consisting of the path of the corresponding input file and the original key)
  • -fake yes (fake run, only prints the underlying shell commands but does not actually execute them)
  • -memlimit <number of bytes> (set an upper limit on the amount of memory that can be used)

Distributed on Hadoop

To run a program on Hadoop, you basically just have to add the -hadoop option:

$ dumbo start program.py -hadoop <path to local hadoop> \
-input <DFS input path> -output <DFS output path>
$ dumbo cat <DFS output path> -hadoop <path to local hadoop> | more

Other useful options are (see also the Hadoop streaming page and wiki):

  • -inputformat <name of inputformat (class)> (“auto” by default)
  • -input <additional DFS input path>
  • -python <python command to use on nodes> (“python” by default)
  • -name <job name> (“program.py” by default)
  • -numMapTasks <number>
  • -numReduceTasks <number> (no sorting or reducing will take place if this is 0)
  • -priority <priority value> (“NORMAL” by default)
  • -libjar <path to jar> (this jar gets put in the class path)
  • -libegg <path to egg> (this egg gets put in the Python path)
  • -file <local file> (this file will be put in the dir where the python program gets executed)
  • -cacheFile hdfs://<host>:<fs_port>/<path to file>#<link name> (a link “<link name>” to the given file will be in the dir)
  • -cacheArchive hdfs://<host>:<fs_port>/<path to jar>#<link name> (link points to dir that contains files from given jar)
  • -cmdenv <env var name>=<value>
  • -hadoopconf <property name>=<value>
  • -addpath yes (replace each input key by a tuple consisting of the path of the corresponding input file and the original key)
  • -fake yes (fake run, only prints the underlying shell commands but does not actually execute them)
  • -memlimit <number of bytes> (set an upper limit on the amount of memory that can be used)
  • -preoutputs yes (don’t delete intermediate output results – can be useful for debugging when running multiple map/red iterations)
Last edited by e1i45, Thu Aug 20 09:00:44 -0700 2009
Home | Edit | New
Versions: