com.intel.hadoop.graphbuilder.job
Class AbstractPreprocessJob<VidType extends org.apache.hadoop.io.WritableComparable<VidType>,VertexData extends org.apache.hadoop.io.Writable,EdgeData extends org.apache.hadoop.io.Writable>
java.lang.Object
com.intel.hadoop.graphbuilder.job.AbstractPreprocessJob<VidType,VertexData,EdgeData>
- Type Parameters:
VidType
- VertexData
- EdgeData
-
- Direct Known Subclasses:
- CreateLinkGraph.Job, CreateWordCountGraph.Job, PreprocessJobTest.Job
public abstract class AbstractPreprocessJob<VidType extends org.apache.hadoop.io.WritableComparable<VidType>,VertexData extends org.apache.hadoop.io.Writable,EdgeData extends org.apache.hadoop.io.Writable>
- extends java.lang.Object
An abstract wrapper class for running the Preprocessing Job, which creates a
graph from the raw input data. See an example in PreprocessJobTest
.
User first needs a GraphTokenizer
, and a InputFormat
specific
to the input data. The InputFormat
is used for generate a single
input from the raw data. And the GraphTokenizer
is used for extract a
list of s and Edge
s from each input given by the
InputFormat
. For example, to create a link graph from a Wikipedia xml
dump, WikiPageInputFormat
splits the file by the begin and close of
"page" tag, and output the string in between as a "page" to the
LinkGraphTokenizer
, which then extract the title of the page as the
vertex and link as the edges.
Next, user will need to override 3 functions: vertexReducer()
, and
edgeReducer()
, which are applied to duplicate vertices and edges.
They both can return null
in which case only the first instance of
duplicate objects will remain. The third function to override is
cleanBidirectionalEdge()
, which is the option to keep or discard the
bi-directional edges in the graph.
Additional options can be added into the jobConf by calling
addUserOpt
. Functional
s can get option using
configure(JobConf)
.
Input directories contain any raw input data. Output directories:
- $outputdir/edata list of edges
- $outputdir/vdata list of vertices
- See Also:
CreateGraphMR
,
PreprocessJobTest
,
XMLInputFormat
,
WikiPageInputFormat
,
LinkGraphTokenizer
,
Functional
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
AbstractPreprocessJob
public AbstractPreprocessJob()
vertexReducer
public abstract Functional<VertexData,VertexData> vertexReducer()
edgeReducer
public abstract Functional<EdgeData,EdgeData> edgeReducer()
cleanBidirectionalEdge
public abstract boolean cleanBidirectionalEdge()
addUserOpt
public void addUserOpt(java.lang.String key,
java.lang.String value)
run
public boolean run(GraphTokenizer<VidType,VertexData,EdgeData> tokenizer,
org.apache.hadoop.mapred.InputFormat inputformat,
java.lang.String[] inputs,
java.lang.String output)
throws javassist.CannotCompileException,
javassist.NotFoundException,
java.io.IOException
- Throws:
javassist.CannotCompileException
javassist.NotFoundException
java.io.IOException