gov.llnl.ontology.text.hbase
Class GzipTarInputFormat

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<K,V>
      extended by org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
          extended by gov.llnl.ontology.text.hbase.GzipTarInputFormat

public class GzipTarInputFormat
extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>

A FileInputFormat for handling gzipped tarball files with each internal file containing data for a single document. This assumes that the file, or files, being processed are in raw text format and contain one file path per line of gzipped tarballs. Each entry in the gzipped tarball will be considered a single document.

Author:
Keith Stevens

Nested Class Summary
 class GzipTarInputFormat.GzipTarRecordReader
          A RecordReader for processing gzipped tarballs of document files.
 
Constructor Summary
GzipTarInputFormat()
           
 
Method Summary
 org.apache.hadoop.mapreduce.RecordReader createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)
          Returns a GzipTarInputFormat.GzipTarRecordReader.
 List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
          Returns a List of FileSplits.
 
Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, isSplitable, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

GzipTarInputFormat

public GzipTarInputFormat()
Method Detail

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                   org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                            throws IOException,
                                                                   InterruptedException
Returns a GzipTarInputFormat.GzipTarRecordReader. The record reader will return each tarred file.

Specified by:
createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
Throws:
IOException
InterruptedException

getSplits

public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
                                                       throws IOException
Returns a List of FileSplits. Each FileSplit will be a gzipped tarball of xml documents. Each tarred file should contain a single document.

Overrides:
getSplits in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
Throws:
IOException


Copyright © 2010-2011. All Rights Reserved.