Mapducer
Background
Mapducer is a mapreduce implementation for (actual) clusters. There are lots of implementations out there (Hadoop, SkyNet, etc) for grid computing, but none for a truly clustered environment. When you already have a job scheduler, etc all you need is a simple program to create some kid processes and start mapping / reducing; that's all that mapducer does. You can absolutely just run mapducer on a single machine (clusters are logically just a simple, single giant machine).
Since this is a reduction of the general full implementations of mapreduce, the "reduce" has been reduced to "duce", hence "mapducer"
Use It
First, read the paper from Google. Next, learn python Python. Then, check out the repository:
svn co https://projects.dbbe.musc.edu/public/mapducer/trunk mapducer
Take a look at the example in trunk directory. The only function you need is the "mapducer" function:
mapducer(mapfunc, reducefunc, inputdir, outputdir, numkids, outputformat = "%s %s", usedb = False)
Where:
- mapfunc is the mapping function that accepts a key (usually the path of the file being mapped) and a file handle. It should yield key, value string tuples (note, yield, not return).
- reducefunc is the reduce function that accepts a string key and an array of string values. It returns a string reduction of the parameters.
- inputdir is the directory containing input files. They will be assigned evenly to the processes.
- outputdir is the directory for the output files (one per process child).
- numkids is the number of child processes to create (based on the size of your cluster).
- outputformat is an optional parameter that specifies the format of the reduced key/value pairs that should be written to the output files.
- usedb is an optional parameter that tells mapducer to use a BDB single file database as a cache of mapped key/value array pairs rather than storing them in memory. This is only a good idea if your files are huge (it decreases speed by a few orders of magnitude).
