Rants: Map Reduce (part 1)

Map and reduce have had a come back after Google published a paper on how they do distributed computing.

Can we do it in Java?

For starters, there are no standard map and reduce implementations in Java as there are in ML. The map here is not the Map interface, but the Lisp mapcar function.

Here is my attempt to implement map and reduce in Java:

import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

interface Applicable
{
Object apply( Object ob );
}
interface Reductible
{
Object oper( Object ob1, Object ob2 );
}

interface Mapping
{
List map( List list, Applicable app );
Object reduce( List list, Object val, Reductible redux );
Map map( Map map, Applicable app );
Map mapKeys( Map map, Applicable app );
Map mapValues( Map map, Applicable app );
}

class SimpleMapping implements Mapping
{
public List map( List list, Applicable app )
{
List ret = new ArrayList();
for ( Iterator it = list.iterator(); it.hasNext(); )
{
Object value = it.next();
Object newVal = app.apply( value );
ret.add( newVal );
}
return ret;
}
public Object reduce( List list, Object val, Reductible redux )
{
Object ret = val;
for ( Iterator it = list.iterator(); it.hasNext(); )
{
Object value = it.next();
ret = redux.oper( ret, value );
}
return ret;
}
public Map map( Map map, Applicable app )
{
Map ret = new HashMap();
for ( Iterator it = map.keySet().iterator(); it.hasNext(); )
{
Object key = it.next();
Object value = map.get( key );
Object newKey = app.apply( key );
Object newVal = app.apply( value );
ret.put( newKey, newVal );
}
return ret;
}
public Map mapKeys( Map map, Applicable app )
{
Map ret = new HashMap();
for ( Iterator it = map.keySet().iterator(); it.hasNext(); )
{
Object key = it.next();
Object value = map.get( key );
Object newKey = app.apply( key );
ret.put( newKey, value );
}
return ret;
}
public Map mapValues( Map map, Applicable app )
{
Map ret = new HashMap();
for ( Iterator it = map.keySet().iterator(); it.hasNext(); )
{
Object key = it.next();
Object value = map.get( key );
Object newVal = app.apply( value );
ret.put( key, newVal );
}
return ret;
}
}

// Example

public class Example
{
public void testMapping()
{
Mapping mapping = new SimpleMapping();
List list = new ArrayList();
list.add( new Integer( 1 ) );
list.add( new Integer( 2 ) );
list.add( new Integer( 3 ) );
printList( list );
list = mapping.map( list, new Applicable() {
public Object apply( Object ob )
{
Integer val = (Integer) ob;
val = new Integer( val.intValue() + 1 );
return val;
}
} );
printList( list );
}
public void testReduce()
{
Mapping mapping = new SimpleMapping();
List list = new ArrayList();
list.add( new Integer( 1 ) );
list.add( new Integer( 2 ) );
list.add( new Integer( 3 ) );
printList( list );
list = mapping.map( list, new Applicable() {
public Object apply( Object ob )
{
Integer val = (Integer) ob;
val = new Integer( val.intValue() + 1 );
return val;
}
} );
Integer zero = new Integer( 0 );
Integer result = (Integer) mapping.reduce( list, zero, new Reductible() {
public Object oper( Object ob1, Object ob2 )
{
Integer val1 = (Integer) ob1;
Integer val2 = (Integer) ob2;
Integer val3 = new Integer( val1.intValue() + val2.intValue() );
return val3;
}
} );
System.out.println( "sum = " + result );
printList( list );
}
public void printList( List list )
{
Mapping mapping = new SimpleMapping();
System.out.print( "[ " );
mapping.map( list, new Applicable() {
public Object apply( Object ob )
{
System.out.print( ob + " " );
return ob;
}
} );
System.out.println( "]" );
}
public static void main( String [] args )
{
Example example = new Example();
example.testMapping();
example.testReduce();
}
}

As you can see this is pretty... verbose. I doubt anyone would like to use this implementation because it is so cumbersome. First class blocks in Java would help (what Java7's closures are supposed to add), probably now you think the word "closure" is a misnomer and "first class block" is the correct wording.

Once that philosophical and etimological feature has been defined and probably corrected, the most important thing about map/reduce is that you want to code something using this map/reduce technique and you want this to automatically scale to a distributed algorithm with thousands of computers doign a part of the calculation.

Let start the design.

First, how can you call an interface and either call this SimpleMapping or the DistributedMapping (yet to be written).

I've written an EjbProxy, which basically implements a "jump" from one JVM to the next using the EJB standard. EJB calls are rather expensive, so you only want to do one of those expensive jumps when the data is too large to fit in one computer, or when the processing is so expensive (like finding all possible divisors of n, that is all p and q so that p * q = n), so that the search space can be divided among several computers.

What implementation would the EjbProxy call for the DistributedMapping?

It would call itself, but using a different range of values to search. Let as call it the SearchContext. So a DistributedMapping would have several machines each with its own search context.

In other words we would have a HashMap on each computer saying:

SearchContext -> machine

And also each machine needs to know which SearchContext is its own:

machine -> SearchContext

So it really would be handy if we could have a bidirectional mapping:

SearchContext <-> machine

Rants

viernes, 27 de julio de 2007

Map Reduce (part 1)

No hay comentarios:

Archivo del blog

Datos personales