com.intel.hadoop.graphbuilder.job
Class AbstractPreprocessJob<VidType extends org.apache.hadoop.io.WritableComparable<VidType>,VertexData extends org.apache.hadoop.io.Writable,EdgeData extends org.apache.hadoop.io.Writable>

java.lang.Object
  extended by com.intel.hadoop.graphbuilder.job.AbstractPreprocessJob<VidType,VertexData,EdgeData>
Type Parameters:
VidType -
VertexData -
EdgeData -
Direct Known Subclasses:
CreateLinkGraph.Job, CreateWordCountGraph.Job, PreprocessJobTest.Job

public abstract class AbstractPreprocessJob<VidType extends org.apache.hadoop.io.WritableComparable<VidType>,VertexData extends org.apache.hadoop.io.Writable,EdgeData extends org.apache.hadoop.io.Writable>
extends java.lang.Object

An abstract wrapper class for running the Preprocessing Job, which creates a graph from the raw input data. See an example in PreprocessJobTest.

User first needs a GraphTokenizer, and a InputFormat specific to the input data. The InputFormat is used for generate a single input from the raw data. And the GraphTokenizer is used for extract a list of s and Edges from each input given by the InputFormat. For example, to create a link graph from a Wikipedia xml dump, WikiPageInputFormat splits the file by the begin and close of "page" tag, and output the string in between as a "page" to the LinkGraphTokenizer, which then extract the title of the page as the vertex and link as the edges.

Next, user will need to override 3 functions: vertexReducer(), and edgeReducer(), which are applied to duplicate vertices and edges. They both can return null in which case only the first instance of duplicate objects will remain. The third function to override is cleanBidirectionalEdge(), which is the option to keep or discard the bi-directional edges in the graph.

Additional options can be added into the jobConf by calling addUserOpt. Functionals can get option using configure(JobConf).

Input directories contain any raw input data. Output directories:

See Also:
CreateGraphMR, PreprocessJobTest, XMLInputFormat, WikiPageInputFormat, LinkGraphTokenizer, Functional

Constructor Summary
AbstractPreprocessJob()
           
 
Method Summary
 void addUserOpt(java.lang.String key, java.lang.String value)
           
abstract  boolean cleanBidirectionalEdge()
           
abstract  Functional<EdgeData,EdgeData> edgeReducer()
           
 boolean run(GraphTokenizer<VidType,VertexData,EdgeData> tokenizer, org.apache.hadoop.mapred.InputFormat inputformat, java.lang.String[] inputs, java.lang.String output)
           
abstract  Functional<VertexData,VertexData> vertexReducer()
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

AbstractPreprocessJob

public AbstractPreprocessJob()
Method Detail

vertexReducer

public abstract Functional<VertexData,VertexData> vertexReducer()

edgeReducer

public abstract Functional<EdgeData,EdgeData> edgeReducer()

cleanBidirectionalEdge

public abstract boolean cleanBidirectionalEdge()

addUserOpt

public void addUserOpt(java.lang.String key,
                       java.lang.String value)

run

public boolean run(GraphTokenizer<VidType,VertexData,EdgeData> tokenizer,
                   org.apache.hadoop.mapred.InputFormat inputformat,
                   java.lang.String[] inputs,
                   java.lang.String output)
            throws javassist.CannotCompileException,
                   javassist.NotFoundException,
                   java.io.IOException
Throws:
javassist.CannotCompileException
javassist.NotFoundException
java.io.IOException