com.ebay.erl.mobius.core
Class MobiusJob

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by com.ebay.erl.mobius.core.MobiusJob
All Implemented Interfaces:
java.io.Serializable, org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public abstract class MobiusJob
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool, java.io.Serializable

Main class of the Mobius API. Extends this class to create a Mobius data processing flow. This product is licensed under the Apache License, Version 2.0, available at http://www.apache.org/licenses/LICENSE-2.0. This product contains portions derived from Apache hadoop which is licensed under the Apache License, Version 2.0, available at http://hadoop.apache.org. © 2007 – 2012 eBay Inc., Evan Chiu, Woody Zhou, Neel Sundaresan

See Also:
Serialized Form

Constructor Summary
MobiusJob()
           
 
Method Summary
protected  void addToExecQueue(org.apache.hadoop.conf.Configuration aNewJobConf)
          Add a job, represented by the aNewJob object, into the execution queue.
 org.apache.hadoop.conf.Configuration getConf()
          Return the Hadoop job configuration.
protected  org.apache.hadoop.fs.FileSystem getFS()
           
protected  java.lang.String getPathOnly(java.lang.String uriStr)
          returning only the "path" part of the input URI.
 GroupByConfigure group(Dataset aDataset)
          Start a group-by job.
 JoinOnConfigure innerJoin(Dataset... datasets)
          Perform inner join on the given datasets.
 boolean isOutputOfAnotherJob(org.apache.hadoop.fs.Path input)
          Test if the given input is the output of another job or not
 boolean isOutputOfAnotherJob(java.lang.String input)
          Test if the given input is the output of another job or not
 JoinOnConfigure leftOuterJoin(Dataset left, Dataset right)
          Performing "Left Outer Join", the result contains all the records of the left Dataset (the 1st Dataset) with or without match to the right Dataset.
 JoinOnConfigure leftOuterJoin(Dataset left, Dataset right, java.lang.Object nullReplacement)
          Performing "Left Outer Join", the result contains all the records of the left Dataset (the 1st Dataset) with or without match to the right Dataset.
 Dataset list(Dataset dataset, Column... columns)
          Select the columns from the dataset.
 Dataset list(Dataset dataset, org.apache.hadoop.fs.Path outputFolder, java.lang.Class<? extends org.apache.hadoop.mapred.FileOutputFormat> outputFormat, Column... columns)
          Select the columns from the dataset, store it into outputFolder with the given outputFormat
 Dataset list(Dataset dataset, org.apache.hadoop.fs.Path outputFolder, Column... columns)
          Select the columns from the dataset and store it into outputFolder.
 org.apache.hadoop.fs.Path newTempPath()
          create an empty folder under hadoop.tmp.dir.
 JoinOnConfigure rightOuterJoin(Dataset left, Dataset right)
          Performing "Right Outer Join", the result contains all the records of the right Dataset (the 2nd argument) with or without match to the left Dataset.
 JoinOnConfigure rightOuterJoin(Dataset left, Dataset right, java.lang.Object nullReplacement)
          Performing "Right Outer Join", the result contains all the records of the right Dataset (the 2nd argument) with or without match to the left Dataset.
 SortProjectionConfigure sort(Dataset aDataset)
          Performing a total sort on the aDataset.
 
Methods inherited from class org.apache.hadoop.conf.Configured
setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.util.Tool
run
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
setConf
 

Constructor Detail

MobiusJob

public MobiusJob()
Method Detail

getConf

public org.apache.hadoop.conf.Configuration getConf()
Return the Hadoop job configuration.

Note that, this method creates a new Configuration from the default one every time, so changes that are made to the returned Configuration won't affect the conf returned by the next call of getConf().

Specified by:
getConf in interface org.apache.hadoop.conf.Configurable
Overrides:
getConf in class org.apache.hadoop.conf.Configured

isOutputOfAnotherJob

public boolean isOutputOfAnotherJob(org.apache.hadoop.fs.Path input)
Test if the given input is the output of another job or not

Parameters:
input - input path of a job.
Returns:
true if the input is the output path of another job, false otherwise.

isOutputOfAnotherJob

public boolean isOutputOfAnotherJob(java.lang.String input)
Test if the given input is the output of another job or not

Parameters:
input - input path of a job
Returns:
true if the input is the output path of another job, false otherwise.

list

public Dataset list(Dataset dataset,
                    org.apache.hadoop.fs.Path outputFolder,
                    java.lang.Class<? extends org.apache.hadoop.mapred.FileOutputFormat> outputFormat,
                    Column... columns)
             throws java.io.IOException
Select the columns from the dataset, store it into outputFolder with the given outputFormat

Here is an example:

 
 public MyJob extends MobiusJob
 {
        public void run(String[] args)
        {
                Dataset students = ...;
                
                // save the result to $OUTPUT in SequenceFileOutputFormat,
                // the key will be NullWritable, and the value is a Tuple 
                // which contains 3 columns, id, f_name and l_name.
                this.list(students,
                        new Path("$OUTPUT"),
                        SequenceFileOutputFormat.class,
                        new Column(students, "id"),
                        new Column(students, "f_name"),
                        new Column(students, "l_name")
                ); 
        }
        
        public static void main(String[] args) throw Exception
        {
                System.exit(MobiusJobRunner.run(new MyJob(), args));
        }
 }
 
 

Throws:
java.io.IOException

list

public Dataset list(Dataset dataset,
                    org.apache.hadoop.fs.Path outputFolder,
                    Column... columns)
             throws java.io.IOException
Select the columns from the dataset and store it into outputFolder.

The output format is TextOutputFormat.

Here is an example:

 
 public MyJob extends MobiusJob
 {
        public void run(String[] args)
        {
                Dataset students = ...;
                
                // save the result to $OUTPUT in TextOutputFormat,
                // output will be tab delimited files with 3 columns,
                // id, f_name and l_name.
                //
                // To change the delimiter, put -Dmobius.tuple.tostring.delimiter=YOUR_DELIMITER
                // when submitting a job in command line. 
                this.list(students,
                        new Path("$OUTPUT"),                    
                        new Column(students, "id"),
                        new Column(students, "f_name"),
                        new Column(students, "l_name")
                ); 
        }
        
        public static void main(String[] args) throw Exception
        {
                System.exit(MobiusJobRunner.run(new MyJob(), args));
        }
 }
 
 

Throws:
java.io.IOException

list

public Dataset list(Dataset dataset,
                    Column... columns)
             throws java.io.IOException
Select the columns from the dataset.

The output path is a temporal path under hadoop.tmp.dir, and the output format is SequenceFileOutputFormat.

Here is an example:

 
 public MyJob extends MobiusJob
 {
        public void run(String[] args)
        {
                Dataset students = ...;
                
                this.list(students, 
                        new Column(students, "id"),
                        new Column(students, "f_name"),
                        new Column(students, "l_name")
                ); 
        }
        
        public static void main(String[] args) throw Exception
        {
                System.exit(MobiusJobRunner.run(new MyJob(), args));
        }
 }
 
 

Throws:
java.io.IOException

leftOuterJoin

public JoinOnConfigure leftOuterJoin(Dataset left,
                                     Dataset right,
                                     java.lang.Object nullReplacement)
                              throws java.io.IOException
Performing "Left Outer Join", the result contains all the records of the left Dataset (the 1st Dataset) with or without match to the right Dataset.

If in a join group, there is no records from the right Dataset (the 2nd argument), by default, null(if the output format is SequenceFileOutputFormat) or empty string (if the output format is TextOutputFormat) is written for the selected columns from the right Dataset.

If nullReplacement is not null, then it will be used as the value for the columns from the right dataset when no match in a join group.

To compose a leftOuterJoin is almost the same as composing a innerJoin(Dataset...) job except that instead of calling innerJoin, simply change it to leftOuterJoin(Dataset, Dataset, Object).

Parameters:
left - left-hand side Dataset
right - right-hand side Dataset
nullReplacement - the value to be used as the value for null columns, it can be only the type supported by Tuple
Throws:
java.io.IOException

leftOuterJoin

public JoinOnConfigure leftOuterJoin(Dataset left,
                                     Dataset right)
                              throws java.io.IOException
Performing "Left Outer Join", the result contains all the records of the left Dataset (the 1st Dataset) with or without match to the right Dataset.

If in a join group, there is no records from the right Dataset (the 2nd argument), by default, null(if the output format is SequenceFileOutputFormat) or empty string (if the output format is TextOutputFormat) is written for the selected columns from the right Dataset.

To compose a leftOuterJoin is almost the same as composing a innerJoin(Dataset...) job except that instead of calling innerJoin, simply change it to leftOuterJoin(Dataset, Dataset).

Parameters:
left - left-hand side Dataset
right - right-hand side Dataset
Throws:
java.io.IOException

rightOuterJoin

public JoinOnConfigure rightOuterJoin(Dataset left,
                                      Dataset right,
                                      java.lang.Object nullReplacement)
                               throws java.io.IOException
Performing "Right Outer Join", the result contains all the records of the right Dataset (the 2nd argument) with or without match to the left Dataset.

If in a join group, there is no records from the right Dataset (the 2nd argument), by default, null(if the output format is SequenceFileOutputFormat) or empty string (if the output format is TextOutputFormat) is written for the selected columns from the left Dataset

If nullReplacement is not null, then it will be used as the value for the columns from the left dataset when no match in a join group.

To compose a rightOuterJoin is almost the same as composing a innerJoin(Dataset...) job except that instead of calling innerJoin, simply change it to rightOuterJoin(Dataset, Dataset, Object).

Parameters:
left - left-hand side Dataset
right - right-hand side Dataset
nullReplacement - the value to be used as the value for null columns, it can be only the type supported by Tuple
Throws:
java.io.IOException

rightOuterJoin

public JoinOnConfigure rightOuterJoin(Dataset left,
                                      Dataset right)
                               throws java.io.IOException
Performing "Right Outer Join", the result contains all the records of the right Dataset (the 2nd argument) with or without match to the left Dataset.

If in a join group, there is no records from the right Dataset (the 2nd argument), by default, null(if the output format is SequenceFileOutputFormat) or empty string (if the output format is TextOutputFormat) is written for the selected columns from the left Dataset

To compose a rightOuterJoin is almost the same as composing a innerJoin(Dataset...) job except that instead of calling innerJoin, simply change it to rightOuterJoin(Dataset, Dataset).

Parameters:
left - left-hand side Dataset
right - right-hand side Dataset
nullReplacement - the value to be used as the value for null columns, it can be only the type supported by Tuple
Throws:
java.io.IOException

innerJoin

public JoinOnConfigure innerJoin(Dataset... datasets)
Perform inner join on the given datasets.

The number of datasets must >= 2. One can join more than two Dataset at once only if the datasets have a shared key, i.e., they have columns that share the same meaning, the name of the columns don't have to be the same, but the content (value) of the columns need to be the same.

Form the performance perspective, the biggest dataset should be placed in the right most side. The bigness is measured in terms of values in a join key, NOT by the total number of records of a dataset.

Here is an example of how to create a inner join job:

 
 public class MyJob extends MobiusJob
 {
        public void run(String[] args) throws Exception
        {
                Dataset students = ...;
                Dataset courses = ...;
 
                this
                .innerJoin(students, courses)
                .on( new EQ(new Column(students, "student_id"), new Column(courses, "student_id")) )
                .save(this, new Path("$OUTPUT"),
                        new Column(students, "student_id"), 
                        new Column(students, "f_name"),
                        new Column(students, "l_name"),
                        new Column(courses, "c_title")
                );
        }
        
        public static void main(String[] args) throws Exception
        {
                System.exit(MobiusJobRunner.run(new MyJob(), args));
        }
 }
 
 


group

public GroupByConfigure group(Dataset aDataset)
Start a group-by job.

Group-by the given aDataset by certain column(s) (to be specified in the returned GroupByConfigure).

Here is an example of group-by job:

 
 public class MyJob extends MobiusJob
 {
        public void run(String[] args) throws Exception
        {
                .....
                this
                .group(order)
                .by(new Column(order, "order_person_id"))
                .save(this,
                        new Path("$OUTPUT_PATH"),
                        new Column(order, "order_person_id"),
                        new Max(new Column(order, "order_id")));
        }
 
        public static void main(String[] args) throws Exception
        {
                System.exit(MobiusJobRunner.run(new MyJob(), args));
        }
 }
 
 


sort

public SortProjectionConfigure sort(Dataset aDataset)
                             throws java.io.IOException
Performing a total sort on the aDataset.

After the job has finished, concatenating the out files together, the values in the files are sorted according to the given Sorter.

Here is an example of how to start a sort job:

 
 public MyJob extends MobiusJob
 {
        public void run(String[] args) throws Exception
        {
                .....
                this
                .sort(person)
                .select(
                        new Column(ds, "age"),
                        new Column(ds, "gender"),
                        new Column(ds, "fname"),
                        new Column(ds, "lname"))
                .orderBy(
                        new Sorter(new Column(ds, "age"), Ordering.ASC, true),
                        new Sorter(new Column(ds, "gender"), Ordering.DESC, true))
                .save(
                        this,
                        new Path("$OUTPUT")
                );
        }
 
        public static void main(String[] args) throws Exception
        {
                System.exit(MobiusJobRunner.run(new MyJob(), args));
        }
 }
 
 

Throws:
java.io.IOException

getFS

protected org.apache.hadoop.fs.FileSystem getFS()

newTempPath

public org.apache.hadoop.fs.Path newTempPath()
                                      throws java.io.IOException
create an empty folder under hadoop.tmp.dir.

Throws:
java.io.IOException

addToExecQueue

protected void addToExecQueue(org.apache.hadoop.conf.Configuration aNewJobConf)
                       throws java.io.IOException
Add a job, represented by the aNewJob object, into the execution queue.

Users can use this method to add one or more jobs' configuration into the job queue, and Mobius engine will analyze the aNewJob objects within the queue to understand the dependence of jobs. For example, if job B's input is from job A, then job B won't be submitted until A is completed successfully. If A failed, the B will not be submitted.

Parameters:
aNewJobConf - a Configuration object represents a Hadoop job.
Throws:
java.io.IOException

getPathOnly

protected java.lang.String getPathOnly(java.lang.String uriStr)
returning only the "path" part of the input URI.