Java tutorial
/** * Making GZip Splittable for Apache Hadoop * Copyright (C) 2011-2014 Niels Basjes * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package nl.basjes.hadoop.io.compress; import java.io.EOFException; import java.io.IOException; import java.io.InputStream; import org.apache.commons.codec.binary.Hex; import org.apache.hadoop.io.compress.CompressionInputStream; import org.apache.hadoop.io.compress.Decompressor; import org.apache.hadoop.io.compress.GzipCodec; import org.apache.hadoop.io.compress.SplitCompressionInputStream; import org.apache.hadoop.io.compress.SplittableCompressionCodec; import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** * For each "split" the gzipped input file is read from the beginning of the * file till the point where the split starts, thus reading, decompressing and * discarding (wasting!) everything that is before the start of the split.<br> * <br> * <b>FACT: Files compressed with the Gzip codec are <i>NOT SPLITTABLE</i>. * Never have been, never will be.</b><br> * <br> * This codec offers a trade off between "spent resources" and "scalability" * when reading Gzipped input files by simply always starting at the beginning * of the file.<br> * So in general this "splittable" Gzip codec will <b>WASTE</b> CPU time and * FileSystem IO (HDFS) and probably other system resources (Network) too to * reduce the "wall clock" time in some real-life situations.<br> * <br> * <b>When is this useful?</b><br> * Assume you have a heavy map phase for which the input is a 1GiB Apache httpd * logfile. Now assume this map takes 60 minutes of CPU time to run. Then this * task will take 60 minutes to run because all of that CPU time must be spent * on a single CPU core ... Gzip is not splittable!<br> * <br> * This codec will waste CPU power by always starting from the start of the * gzipped file and discard all the decompressed data until the start of the * split has been reached.<br> * <br> * Decompressing a 1GiB Gzip file usually takes only a few (2-4) minutes.<br> * So if a "60 minutes" input file is split into 4 equal parts then:<br> * <ol> * <li>the 1<sup>st</sup> map task will<br> * <ul> * <li>process the 1<sup>st</sup> split (15 minutes)</li> * </ul> * </li> * * <li>the 2<sup>nd</sup> map task will<br> * <ul> * <li><i>discard</i> the 1<sup>st</sup> split ( 1 minute ).</li> * <li><i>process</i> the 2<sup>nd</sup> split (15 minutes).</li> * </ul> * </li> * * <li>the 3<sup>rd</sup> map task will<br> * <ul> * <li><i>discard</i> the 1<sup>st</sup> split ( 1 minute ).</li> * <li><i>discard</i> the 2<sup>nd</sup> split ( 1 minute ).</li> * <li><i>process</i> the 3<sup>rd</sup> split (15 minutes).</li> * </ul> * </li> * * <li>the 4<sup>th</sup> task will<br> * <ul> * <li><i>discard</i> the 1<sup>st</sup> split ( 1 minute ).</li> * <li><i>discard</i> the 2<sup>nd</sup> split ( 1 minute ).</li> * <li><i>discard</i> the 3<sup>rd</sup> split ( 1 minute ).</li> * <li><i>process</i> the 4<sup>th</sup> split (15 minutes).</li> * </ul> * </li> * </ol> * Because all tasks run in parallel the running time in this example would be * 18 minutes (i.e. the worst split time) instead of the normal 60 minutes. We * have wasted about 6 minutes of CPU time and completed the job in about 30% of * the original wall clock time.<br> * <br> * <b>Using this codec</b> * <ol> * <li>Enable this codec and <i>make sure the regular GzipCodec is NOT used</i>. * This can be done by changing the <b>io.compression.codecs</b> property to * something like this:<br> * <i>org.apache.hadoop.io.compress.DefaultCodec, * nl.basjes.hadoop.io.compress.SplittableGzipCodec, * org.apache.hadoop.io.compress.BZip2Codec</i><br> * </li> * <li>Set the split size to something that works in your situation. This can be * done by setting the appropriate values for * <b>mapreduce.input.fileinputformat.split.minsize</b> and/or * <b>mapreduce.input.fileinputformat.split.maxsize</b>.</li> * </ol> * <b>Tuning for optimal performance and scalability.</b><br/> * The overall advise is to <i>EXPERIMENT</i> with the settings and <i>do * benchmarks</i>.<br> * Remember that: * <ul> * <li>Being able to split the input has a positive effect scalability IFF there * is room to scale out to.</li> * <li>This codec is only useful if there are less Gzipped input file(s) than * available map task slots (i.e. some slots are idle during the input/map * phase).</li> * <li>There is a way of limiting the IO impact. Note that in the above example * the 4th task will read and decompress the ENTIRE input file.</li> * <li>Splitting increases the load on (all kinds of) system resources: CPU and * HDFS/Network. The additional load on the system resources has a negative * effect on the scalability. Splitting a file into 1000 splits will really * hammer the datanodes storing the first block of the file 1000 times.</li> * <li>More splits also affect the number of reduce tasks that follow.</li> * <li>If you create more splits than you have map task slots you will certainly * have a suboptimal setting and you should increase the split size to reduce * the number of splits. * </ul> * * A possible optimum: * <ol> * <li>Upload the input files into HDFS with a blocksize that is equal (or a few * bytes bigger) than the file size.<br> * <i>hadoop fs -Ddfs.block.size=1234567890 -put access.log.gz /logs</i><br> * This has the effect that all nodes that have "a piece of the file" always * have "the entire file". This ensures that no network IO is needed for a * single node to read the file IFF it has it locally available.</li> * <li>The replication of the HDFS determines on how many nodes the input file * is present. So to avoid needless network traffic the number of splits must be * limited to AT MOST the replication factor of the underlying HDFS.</li> * <li>Try to make sure that all splits of an input file are roughly the same * size. Don't be surprised if the optimal setting for the split size turns out * to be 500MiB or even 1GiB.</li> * </ol> * * <br> * <b>Alternative approaches</b><br> * Always remember that there are alternative approaches:<br> * <ol> * <li>Decompress the original gzipped file, split it into pieces and recompress * the pieces before offering them to Hadoop.<br> * For example: http://stackoverflow.com/questions/3960651</li> * <li>Decompress the original gzipped file and compress using a different * splittable codec.<br/> * For example {@link org.apache.hadoop.io.compress.BZip2Codec} or not * compressing at all</li> * </ol> * <hr> * <b>Implementation notes</b><br> * <br> * There were <b>two major hurdles</b> that needed to be solved to make this * work: * <ol> * <li><b>The reported position depends on the read blocksize.</b><br> * If you read information in "records" the getBytesRead() will return a value * that jumps incrementally. Only <i>after</i> a new disk block has been read * will the getBytesRead return a new value. "Read" means: read from disk an * loaded into the decompressor but does NOT yet mean that the uncompressed * information was read.<br> * The solution employed is that when we get close to the end of the split we * switch to a crawling mode. This simply means that the disk reads are reduced * to 1 byte, making the position reporting also 1 byte accurate.<br> * This was implemented in the {@link ThrottleableDecompressorStream}.</li> * * <li><b>The input is compressed.</b><br> * If you read 1 byte (uncompressed) you do not always get an increase in the * reported getBytesRead(). This happens because the value reported by * getBytesRead is all about the filesize on disk (= compressed) and compressed * files have less bytes than the uncompressed data. This makes it impossible to * make two splits meet accurately.<br> * The solution is based around the concept that we try to report the position * as accurately as possible but when we get really close to the end we stop * reporting the truth and we start lying about the position.<br> * The lie we use to cross the split boundry is that 1 uncompressed byte read is * reported as 1 compressed byte increase in position. This was implemented * using a simple state machine with 3 different states on what position is * reported through the getPos(). The state is essentially selected on the * distance to the end.<br> * * These states are: * <ol> * <li><b>REPORT</b><br> * Normally read the bytes and report the actual disk position in the getPos(). * </li> * <li><b>HOLD</b><br> * When very close to the end we no longer change the reported file position for * a while.</li> * <li><b>SLOPE</b><br> * When we are at the end: start reporting 1 byte increase from the getPos for * every uncompressed byte that was read from the stream.</li> * </ol> * The overall effect is that the position reporting near the end of the split * no longer represents the actual position and this makes the position usable * for reliably splitting the input stream.<br> * The actual point where the file is split is shifted a bit to the back of the * file (we're talking bytes, not even KiB) where this shift actually depends on * the compression levels of the data in the stream. If we start too early the * split may happen a byte too early and in the end the last split may lose the * last record(s). So that's why we hold for a while and only start the slope at * the moment we are certain we are beyond the indicated "end".<br> * To ensure the split starts at exactly the same spot as the previous split * would end: we find the start of a split by running over the "part that must * be discarded" as-if it is a split. * </ol> */ public class SplittableGzipCodec extends GzipCodec implements SplittableCompressionCodec { private static final Logger LOG = LoggerFactory.getLogger(SplittableGzipCodec.class); private static final int DEFAULT_FILE_BUFFER_SIZE = 4 * 1024; // 4 KiB public SplittableGzipCodec() { super(); LOG.info("Creating instance of SplittableGzipCodec"); } public SplitCompressionInputStream createInputStream(final InputStream seekableIn, final Decompressor decompressor, final long start, final long end, final READ_MODE readMode) // Ignored by this codec throws IOException { LOG.info("Creating SplittableGzipInputStream (range = [{},{}])", start, end); return new SplittableGzipInputStream(createInputStream(seekableIn, decompressor), start, end, getConf().getInt("io.file.buffer.size", DEFAULT_FILE_BUFFER_SIZE)); } // ------------------------------------------- @Override public CompressionInputStream createInputStream(final InputStream in, final Decompressor decompressor) throws IOException { return new ThrottleableDecompressorStream(in, (decompressor == null) ? createDecompressor() : decompressor, getConf().getInt("io.file.buffer.size", DEFAULT_FILE_BUFFER_SIZE)); } // ========================================== private static final class SplittableGzipInputStream extends SplitCompressionInputStream { // We start crawling when within 110% of the blocksize from the split. private static final float CRAWL_FACTOR = 1.1F; // Just to be sure we always crawl the last part a minimal crawling // distance is defined here... 128 bytes works fine. private static final int MINIMAL_CRAWL_DISTANCE = 128; // At what distance from the target do we HOLD the position reporting. // 128 bytes works fine (same as minimal crawl distance). private static final int POSITION_HOLD_DISTANCE = 128; // When setting log4j into TRACE mode we will report massive amounts // of info when this many bytes near the relevant areas. private static final int TRACE_REPORTING_DISTANCE = 64; private final ThrottleableDecompressorStream in; private final int crawlDistance; private final int bufferSize; // ------------------------------------------- public SplittableGzipInputStream(final CompressionInputStream inputStream, final long start, final long end, final int inputStreamBufferSize) throws IOException { super(inputStream, start, end); bufferSize = inputStreamBufferSize; if (getAdjustedEnd() - getAdjustedStart() < bufferSize) { throw new IllegalArgumentException("The provided InputSplit " + "(" + getAdjustedStart() + ";" + getAdjustedEnd() + "] " + "is " + (getAdjustedEnd() - getAdjustedStart()) + " bytes which is too small. " + "(Minimum is " + bufferSize + ")"); } // We MUST have the option of slowing down the reading of data. // This check will fail if someone creates a subclass that breaks this. if (inputStream instanceof ThrottleableDecompressorStream) { this.in = (ThrottleableDecompressorStream) inputStream; } else { this.in = null; // Permanently cripple this instance ('in' is final) . throw new IOException("The SplittableGzipCodec relies on" + " functionality in the ThrottleableDecompressorStream class."); } // When this close to the end of the split: crawl (read at most 1 byte // at a time) to avoid overshooting the end. // This calculates the distance at which we should switch to crawling. // Fact is that if the previous buffer is 1 byte further than this value // the end of the next block (+ 1 byte) will be the real point where we // will start the crawl. --> either 10% of the bufferSize or the Minimal // crawl distance value. this.crawlDistance = Math.max(Math.round(CRAWL_FACTOR * bufferSize), bufferSize + MINIMAL_CRAWL_DISTANCE); // Now we read the stream until we are at the start of this split. if (start == 0) { return; // That was quick; We're already where we want to be. } // Set the range we want to run over quickly. setStart(0); setEnd(start); // The target buffer to dump the discarded info to. final byte[] skippedBytes = new byte[bufferSize]; LOG.debug("SKIPPING to position :{}", start); while (getPos() < start) { // This reads the input and decompresses the data. if (-1 == read(skippedBytes, 0, bufferSize)) { // An EOF while seeking for the START of the split !?!? throw new EOFException("Unexpected end of input stream when" + " seeking for the start of the split in" + " SplittableGzipCodec:" + " start=" + start + " adjustedStart=" + start + " position=" + getPos()); } } LOG.debug("ARRIVED at target location({}): {}", start, getPos()); // Now we put the real split range values back. setStart(start); setEnd(end); // Set the reporting back to normal posState = POS_STATE.REPORT; } // ------------------------------------------- /** * Position reporting states. */ enum POS_STATE { REPORT, HOLD, SLOPE } private POS_STATE posState = POS_STATE.REPORT; /** * What do we call this state? * * @return String with state name useful for logging and debugging. */ private String getStateName() { switch (posState) { case REPORT: return "REPORT"; case HOLD: return "HOLD"; case SLOPE: return "SLOPE"; default: return "ERROR"; } } // The reported position used in the HOLD and SLOPE states. private long reportedPos = 0; @Override public long getPos() { if (posState == POS_STATE.REPORT) { return getRealPos(); } return reportedPos; } /** * The getPos position of the underlying input stream. * * @return number of bytes that have been read from the compressed input. */ private long getRealPos() { return in.getBytesRead(); } // ------------------------------------------- @Override public int read(final byte[] b, final int off, final int len) throws IOException { final long currentRealPos = getRealPos(); int maxBytesToRead = Math.min(bufferSize, len); final long adjustedEnd = getAdjustedEnd(); final long adjustedStart = getAdjustedStart(); if (adjustedStart >= adjustedEnd) { return -1; // Nothing to read in this split at all --> indicate EOF } final long distanceToEnd = adjustedEnd - currentRealPos; if (distanceToEnd <= crawlDistance) { // We go to a crawl as soon as we are close to the end (or over it). maxBytesToRead = 1; // We're getting close switch (posState) { case REPORT: // If we are within 128 bytes of the end we freeze the current value. if (distanceToEnd <= POSITION_HOLD_DISTANCE) { posState = POS_STATE.HOLD; reportedPos = currentRealPos; LOG.trace("STATE REPORT --> HOLD @ {}", currentRealPos); } break; case HOLD: // When we are ON/AFTER the real "end" then we start the slope. // If we start too early the last split may lose the last record(s). if (distanceToEnd <= 0) { posState = POS_STATE.SLOPE; LOG.trace("STATE HOLD --> SLOPE @ {}", currentRealPos); } break; case SLOPE: // We are reading 1 byte at a time and reporting 1 byte at a time. ++reportedPos; break; default: break; } } else { // At a distance we always do normal reporting // Set the state explicitly: the "end" value can change. posState = POS_STATE.REPORT; } // Debugging facility if (LOG.isTraceEnabled()) { // When tracing do the first few bytes at crawl speed too. final long distanceFromStart = currentRealPos - adjustedStart; if (distanceFromStart <= TRACE_REPORTING_DISTANCE) { maxBytesToRead = 1; } } // Set the input read step to tune the disk reads to the wanted speed. in.setReadStep(maxBytesToRead); // Actually read the information. final int bytesRead = in.read(b, off, maxBytesToRead); // Debugging facility if (LOG.isTraceEnabled()) { if (bytesRead == -1) { LOG.trace("End-of-File"); } else { // Report massive info on the LAST 64 bytes of the split if (getPos() >= getAdjustedEnd() - TRACE_REPORTING_DISTANCE && bytesRead < 10) { final String bytes = new String(b).substring(0, bytesRead); LOG.trace("READ TAIL {} bytes ({} pos = {}/{}): ##{}## HEX:##{}##", bytesRead, getStateName(), getPos(), getRealPos(), bytes, new String(Hex.encodeHex(bytes.getBytes()))); } // Report massive info on the FIRST 64 bytes of the split if (getPos() <= getAdjustedStart() + TRACE_REPORTING_DISTANCE && bytesRead < 10) { final String bytes = new String(b).substring(0, bytesRead); LOG.trace("READ HEAD {} bytes ({} pos = {}/{}): ##{}## HEX:##{}##", bytesRead, getStateName(), getPos(), getRealPos(), bytes, new String(Hex.encodeHex(bytes.getBytes()))); } } } return bytesRead; } // ------------------------------------------- @Override public void resetState() throws IOException { in.resetState(); } // ------------------------------------------- @Override public int read() throws IOException { return in.read(); } // ------------------------------------------- } // =================================================== }