This repository is private.
All pages are served over SSL and all pushing and pulling is done over SSH.
No one may fork, clone, or view it unless they are added as a member.
Every repository with this icon (
) is private.
Every repository with this icon (
This repository is public.
Anyone may fork, clone, or view it.
Every repository with this icon (
) is public.
Every repository with this icon (
Running programs
Important remark for people who are still using Dumbo 0.20: The dumbo command mentioned below is the one installed by the optional installation step described in Building and installing. If you only installed Dumbo as part of Hadoop, then you:
- cannot run programs locally on UNIX,
- have to omit the -hadoop option when you execute contrib/dumbo/bin/dumbo to start a distributed run of a program.
Locally on UNIX
If you completed both the mandatory and optional installation steps described in Building and installing, then you can run a Dumbo program program.py locally as follows:
$ dumbo start program.py -input input.txt -output output.txt
$ dumbo cat output.txt | more
Other useful options for local runs are:
- -inputformat <name of inputformat> (“text” by default)
- -input <additional input path>
- -python <python command to use> (“python” by default)
- -libegg <path to egg> (this egg gets put in the Python path)
- -cmdenv <env var name>=<value>
- -pv yes (use “pv” to display progress info)
- -addpath yes (replace each input key by a tuple consisting of the path of the corresponding input file and the original key)
- -fake yes (fake run, only prints the underlying shell commands but does not actually execute them)
- -memlimit <number of bytes> (set an upper limit on the amount of memory that can be used)
Distributed on Hadoop
To run a program on Hadoop, you basically just have to add the -hadoop option:
$ dumbo start program.py -hadoop <path to local hadoop> \
-input <DFS input path> -output <DFS output path>
$ dumbo cat <DFS output path> -hadoop <path to local hadoop> | more
Other useful options are (see also the Hadoop streaming page and wiki):
- -inputformat <name of inputformat (class)> (“auto” by default)
- -input <additional DFS input path>
- -python <python command to use on nodes> (“python” by default)
- -name <job name> (“program.py” by default)
- -numMapTasks <number>
- -numReduceTasks <number> (no sorting or reducing will take place if this is 0)
- -priority <priority value> (“NORMAL” by default)
- -libjar <path to jar> (this jar gets put in the class path)
- -libegg <path to egg> (this egg gets put in the Python path)
- -file <local file> (this file will be put in the dir where the python program gets executed)
- -cacheFile hdfs://<host>:<fs_port>/<path to file>#<link name> (a link “<link name>” to the given file will be in the dir)
- -cacheArchive hdfs://<host>:<fs_port>/<path to jar>#<link name> (link points to dir that contains files from given jar)
- -cmdenv <env var name>=<value>
- -hadoopconf <property name>=<value>
- -addpath yes (replace each input key by a tuple consisting of the path of the corresponding input file and the original key)
- -fake yes (fake run, only prints the underlying shell commands but does not actually execute them)
- -memlimit <number of bytes> (set an upper limit on the amount of memory that can be used)
- -preoutputs yes (don’t delete intermediate output results – can be useful for debugging when running multiple map/red iterations)






