|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.hadoop.conf.Configured
com.ebay.erl.mobius.core.MobiusJob
public abstract class MobiusJob
Main class of the Mobius API. Extends this class to create a Mobius data processing flow. This product is licensed under the Apache License, Version 2.0, available at http://www.apache.org/licenses/LICENSE-2.0. This product contains portions derived from Apache hadoop which is licensed under the Apache License, Version 2.0, available at http://hadoop.apache.org. © 2007 – 2012 eBay Inc., Evan Chiu, Woody Zhou, Neel Sundaresan
Constructor Summary | |
---|---|
MobiusJob()
|
Method Summary | |
---|---|
protected void |
addToExecQueue(org.apache.hadoop.conf.Configuration aNewJobConf)
Add a job, represented by the aNewJob object, into the execution queue. |
org.apache.hadoop.conf.Configuration |
getConf()
Return the Hadoop job configuration. |
protected org.apache.hadoop.fs.FileSystem |
getFS()
|
protected java.lang.String |
getPathOnly(java.lang.String uriStr)
returning only the "path" part of the input URI. |
GroupByConfigure |
group(Dataset aDataset)
Start a group-by job. |
JoinOnConfigure |
innerJoin(Dataset... datasets)
Perform inner join on the given datasets . |
boolean |
isOutputOfAnotherJob(org.apache.hadoop.fs.Path input)
Test if the given input is the output of another job or not |
boolean |
isOutputOfAnotherJob(java.lang.String input)
Test if the given input is the output of another job or not |
JoinOnConfigure |
leftOuterJoin(Dataset left,
Dataset right)
Performing "Left Outer Join", the result contains all the records of the left Dataset (the 1st Dataset) with or without match to the right Dataset. |
JoinOnConfigure |
leftOuterJoin(Dataset left,
Dataset right,
java.lang.Object nullReplacement)
Performing "Left Outer Join", the result contains all the records of the left Dataset (the 1st Dataset) with or without match to the right Dataset. |
Dataset |
list(Dataset dataset,
Column... columns)
Select the columns from the dataset . |
Dataset |
list(Dataset dataset,
org.apache.hadoop.fs.Path outputFolder,
java.lang.Class<? extends org.apache.hadoop.mapred.FileOutputFormat> outputFormat,
Column... columns)
Select the columns from the dataset , store
it into outputFolder with the given outputFormat |
Dataset |
list(Dataset dataset,
org.apache.hadoop.fs.Path outputFolder,
Column... columns)
Select the columns from the dataset and store
it into outputFolder . |
org.apache.hadoop.fs.Path |
newTempPath()
create an empty folder under hadoop.tmp.dir. |
JoinOnConfigure |
rightOuterJoin(Dataset left,
Dataset right)
Performing "Right Outer Join", the result contains all the records of the right Dataset (the 2nd argument) with or without match to the left Dataset. |
JoinOnConfigure |
rightOuterJoin(Dataset left,
Dataset right,
java.lang.Object nullReplacement)
Performing "Right Outer Join", the result contains all the records of the right Dataset (the 2nd argument) with or without match to the left Dataset. |
SortProjectionConfigure |
sort(Dataset aDataset)
Performing a total sort on the aDataset. |
Methods inherited from class org.apache.hadoop.conf.Configured |
---|
setConf |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.apache.hadoop.util.Tool |
---|
run |
Methods inherited from interface org.apache.hadoop.conf.Configurable |
---|
setConf |
Constructor Detail |
---|
public MobiusJob()
Method Detail |
---|
public org.apache.hadoop.conf.Configuration getConf()
Note that, this method creates a new Configuration
from the default one every time, so changes that are made
to the returned Configuration
won't affect the conf
returned by the next call of getConf()
.
getConf
in interface org.apache.hadoop.conf.Configurable
getConf
in class org.apache.hadoop.conf.Configured
public boolean isOutputOfAnotherJob(org.apache.hadoop.fs.Path input)
input
is the output of another job or not
input
- input path of a job.
true
if the input
is the output
path of another job, false
otherwise.public boolean isOutputOfAnotherJob(java.lang.String input)
input
is the output of another job or not
input
- input path of a job
true
if the input
is the output
path of another job, false
otherwise.public Dataset list(Dataset dataset, org.apache.hadoop.fs.Path outputFolder, java.lang.Class<? extends org.apache.hadoop.mapred.FileOutputFormat> outputFormat, Column... columns) throws java.io.IOException
columns
from the dataset
, store
it into outputFolder
with the given outputFormat
Here is an example:
public MyJob extends MobiusJob
{
public void run(String[] args)
{
Dataset students = ...;
// save the result to $OUTPUT in SequenceFileOutputFormat,
// the key will be NullWritable, and the value is a Tuple
// which contains 3 columns, id, f_name and l_name.
this.list(students,
new Path("$OUTPUT"),
SequenceFileOutputFormat.class,
new Column(students, "id"),
new Column(students, "f_name"),
new Column(students, "l_name")
);
}
public static void main(String[] args) throw Exception
{
System.exit(MobiusJobRunner.run(new MyJob(), args));
}
}
java.io.IOException
public Dataset list(Dataset dataset, org.apache.hadoop.fs.Path outputFolder, Column... columns) throws java.io.IOException
columns
from the dataset
and store
it into outputFolder
.
The output format is TextOutputFormat
.
Here is an example:
public MyJob extends MobiusJob
{
public void run(String[] args)
{
Dataset students = ...;
// save the result to $OUTPUT in TextOutputFormat,
// output will be tab delimited files with 3 columns,
// id, f_name and l_name.
//
// To change the delimiter, put -Dmobius.tuple.tostring.delimiter=YOUR_DELIMITER
// when submitting a job in command line.
this.list(students,
new Path("$OUTPUT"),
new Column(students, "id"),
new Column(students, "f_name"),
new Column(students, "l_name")
);
}
public static void main(String[] args) throw Exception
{
System.exit(MobiusJobRunner.run(new MyJob(), args));
}
}
java.io.IOException
public Dataset list(Dataset dataset, Column... columns) throws java.io.IOException
columns
from the dataset
.
The output path is a temporal path under hadoop.tmp.dir, and the output
format is SequenceFileOutputFormat
.
Here is an example:
public MyJob extends MobiusJob
{
public void run(String[] args)
{
Dataset students = ...;
this.list(students,
new Column(students, "id"),
new Column(students, "f_name"),
new Column(students, "l_name")
);
}
public static void main(String[] args) throw Exception
{
System.exit(MobiusJobRunner.run(new MyJob(), args));
}
}
java.io.IOException
public JoinOnConfigure leftOuterJoin(Dataset left, Dataset right, java.lang.Object nullReplacement) throws java.io.IOException
If in a join group, there is no records from the right Dataset
(the 2nd argument), by default, null
(if the output format is
SequenceFileOutputFormat) or empty string (if the output format is
TextOutputFormat
) is written for the selected columns from
the right Dataset.
If nullReplacement
is not null, then it will be used as
the value for the columns from the right dataset when no match in a
join group.
To compose a leftOuterJoin
is almost the same as composing
a innerJoin(Dataset...)
job except that instead of calling
innerJoin
, simply change it to
leftOuterJoin(Dataset, Dataset, Object)
.
left
- left-hand side Dataset
right
- right-hand side Dataset
nullReplacement
- the value to be used as the value for null columns,
it can be only the type supported by Tuple
java.io.IOException
public JoinOnConfigure leftOuterJoin(Dataset left, Dataset right) throws java.io.IOException
If in a join group, there is no records from the right Dataset
(the 2nd argument), by default, null
(if the output format is
SequenceFileOutputFormat) or empty string (if the output format is
TextOutputFormat
) is written for the selected columns from
the right Dataset.
To compose a leftOuterJoin
is almost the same as composing
a innerJoin(Dataset...)
job except that instead of calling
innerJoin
, simply change it to
leftOuterJoin(Dataset, Dataset)
.
left
- left-hand side Dataset
right
- right-hand side Dataset
java.io.IOException
public JoinOnConfigure rightOuterJoin(Dataset left, Dataset right, java.lang.Object nullReplacement) throws java.io.IOException
If in a join group, there is no records from the right Dataset
(the 2nd argument), by default, null
(if the output format is
SequenceFileOutputFormat) or empty string (if the output format is
TextOutputFormat
) is written for the selected columns from
the left Dataset
If nullReplacement
is not null, then it will be used as
the value for the columns from the left dataset when no match in a
join group.
To compose a rightOuterJoin
is almost the same as composing
a innerJoin(Dataset...)
job except that instead of calling
innerJoin
, simply change it to
rightOuterJoin(Dataset, Dataset, Object)
.
left
- left-hand side Dataset
right
- right-hand side Dataset
nullReplacement
- the value to be used as the value for null columns,
it can be only the type supported by Tuple
java.io.IOException
public JoinOnConfigure rightOuterJoin(Dataset left, Dataset right) throws java.io.IOException
If in a join group, there is no records from the right Dataset
(the 2nd argument), by default, null
(if the output format is
SequenceFileOutputFormat) or empty string (if the output format is
TextOutputFormat
) is written for the selected columns from
the left Dataset
To compose a rightOuterJoin
is almost the same as composing
a innerJoin(Dataset...)
job except that instead of calling
innerJoin
, simply change it to
rightOuterJoin(Dataset, Dataset)
.
left
- left-hand side Dataset
right
- right-hand side Dataset
nullReplacement
- the value to be used as the value for null columns,
it can be only the type supported by Tuple
java.io.IOException
public JoinOnConfigure innerJoin(Dataset... datasets)
datasets
.
The number of datasets
must >= 2.
One can join more than two Dataset
at once
only if the datasets have a shared key, i.e., they have
columns that share the same meaning, the name of
the columns don't have to be the same, but the content
(value) of the columns need to be the same.
Form the performance perspective, the biggest dataset should be placed in the right most side. The bigness is measured in terms of values in a join key, NOT by the total number of records of a dataset.
Here is an example of how to create a inner join job:
public class MyJob extends MobiusJob
{
public void run(String[] args) throws Exception
{
Dataset students = ...;
Dataset courses = ...;
this
.innerJoin(students, courses)
.on( new EQ(new Column(students, "student_id"), new Column(courses, "student_id")) )
.save(this, new Path("$OUTPUT"),
new Column(students, "student_id"),
new Column(students, "f_name"),
new Column(students, "l_name"),
new Column(courses, "c_title")
);
}
public static void main(String[] args) throws Exception
{
System.exit(MobiusJobRunner.run(new MyJob(), args));
}
}
public GroupByConfigure group(Dataset aDataset)
Group-by the given aDataset
by
certain column(s) (to be specified in the returned
GroupByConfigure
).
Here is an example of group-by job:
public class MyJob extends MobiusJob
{
public void run(String[] args) throws Exception
{
.....
this
.group(order)
.by(new Column(order, "order_person_id"))
.save(this,
new Path("$OUTPUT_PATH"),
new Column(order, "order_person_id"),
new Max(new Column(order, "order_id")));
}
public static void main(String[] args) throws Exception
{
System.exit(MobiusJobRunner.run(new MyJob(), args));
}
}
public SortProjectionConfigure sort(Dataset aDataset) throws java.io.IOException
After the job has finished, concatenating
the out files together, the values in the files
are sorted according to the given Sorter
.
Here is an example of how to start a sort
job:
public MyJob extends MobiusJob
{
public void run(String[] args) throws Exception
{
.....
this
.sort(person)
.select(
new Column(ds, "age"),
new Column(ds, "gender"),
new Column(ds, "fname"),
new Column(ds, "lname"))
.orderBy(
new Sorter(new Column(ds, "age"), Ordering.ASC, true),
new Sorter(new Column(ds, "gender"), Ordering.DESC, true))
.save(
this,
new Path("$OUTPUT")
);
}
public static void main(String[] args) throws Exception
{
System.exit(MobiusJobRunner.run(new MyJob(), args));
}
}
java.io.IOException
protected org.apache.hadoop.fs.FileSystem getFS()
public org.apache.hadoop.fs.Path newTempPath() throws java.io.IOException
java.io.IOException
protected void addToExecQueue(org.apache.hadoop.conf.Configuration aNewJobConf) throws java.io.IOException
aNewJob
object, into the execution queue.
Users can use this method to add one or more jobs' configuration into the job queue, and Mobius engine
will analyze the aNewJob
objects within the queue to understand the dependence of jobs.
For example, if job B's input is from job A, then job B won't be submitted until A is completed
successfully. If A failed, the B will not be submitted.
aNewJobConf
- a Configuration
object represents a Hadoop job.
java.io.IOException
protected java.lang.String getPathOnly(java.lang.String uriStr)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |