This repository is private.
All pages are served over SSL and all pushing and pulling is done over SSH.
No one may fork, clone, or view it unless they are added as a member.
Every repository with this icon (
) is private.
Every repository with this icon (
This repository is public.
Anyone may fork, clone, or view it.
Every repository with this icon (
) is public.
Every repository with this icon (
Building and installing
Dumbo Version 0.21
Patching Hadoop
If you use an earlier version of Hadoop than 0.21 (which hasn’t been released yet as of this writing), you first have to apply a few patches to Hadoop. More precisely, you then have to download the patches for HADOOP-1722, HADOOP-5450, and MAPREDUCE-764 and rebuild Hadoop after applying these patches (the order in which you apply the patches is important!):
$ cd /path/to/hadoop
$ patch -p0 < /path/to/HADOOP-1722.patch
$ patch -p0 < /path/to/HADOOP-5450.patch
$ patch -p0 < /path/to/MAPREDUCE-764.patch
$ ant package
If you want to use Dumbo’s convenient joining abstraction, you need to apply HADOOP-5528 as well.
Installing Dumbo
On the machine from which you want to run your Dumbo programs, do:
$ wget http://peak.telecommunity.com/dist/ez_setup.py
$ python ez_setup.py dumbo
Alternatively, you can also install Dumbo in a virtual Python environment:$ virtualenv env
$ env/bin/easy_install dumbo
Dumbo Version 0.20
As part of Hadoop (mandatory)
To build Dumbo, you just have to add it to the src/contrib directory of Hadoop (version 0.18) and build Hadoop:
$ wget http://github.com/klbostee/dumbo/tarball/release-0.20.28 -O dumbo.tar.gz
$ tar zxvf dumbo.tar.gz
$ mv klbostee-dumbo* $HADOOP_HOME/src/contrib/dumbo
$ cd $HADOOP_HOME
$ ant package
This should generate a Hadoop build in build/ that contains a contrib/dumbo directory:$ ls build/hadoop-*/contrib/dumbo
bin examples lib
The shell script example in the subdirectory bin/ runs the wordcount.py example on Hadoop.
As a Python module (optional)
You can also install Dumbo as a Python module on your system:
$ cd $HADOOP_HOME/src/contrib/dumbo
$ sudo ant install_pymod
This additional installation step is not required, but we do recommend it because it allows you to run programs locally using UNIX pipes, which can be very useful for debugging. The dumbo command that gets added to /usr/bin by this optional installation step can be used in the same way as $HADOOP_HOME/build/hadoop-*/contrib/dumbo/bin/dumbo. The only difference is that it requires an additional -hadoop <path_to_hadoop_dir> option. Hence, this same command can be used to run programs on different Hadoop clusters, and by omitting the -hadoop option you can run a Dumbo program locally using UNIX pipes.






