gov.llnl.ontology.text.hbase
Class GzipXmlInputFormat

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<K,V>
      extended by org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
          extended by gov.llnl.ontology.text.hbase.GzipXmlInputFormat

public class GzipXmlInputFormat
extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>

A FileInputFormat for xml files that are gzipped. Before starting a job, call setXMLTags(org.apache.hadoop.mapreduce.Job, java.lang.String) to specify the text of the document deliminiting tags.

Author:
Keith Stevens

Constructor Summary
GzipXmlInputFormat()
           
 
Method Summary
 org.apache.hadoop.mapreduce.RecordReader createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)
          Returns a GzipXmlRecordReader.
 List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
          Returns a List of FileSplits.
static void setXMLTags(org.apache.hadoop.mapreduce.Job job, String delimiter)
           
 
Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, isSplitable, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

GzipXmlInputFormat

public GzipXmlInputFormat()
Method Detail

setXMLTags

public static void setXMLTags(org.apache.hadoop.mapreduce.Job job,
                              String delimiter)

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                   org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                            throws IOException,
                                                                   InterruptedException
Returns a GzipXmlRecordReader. The record reader will return each tarred file.

Specified by:
createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
Throws:
IOException
InterruptedException

getSplits

public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
                                                       throws IOException
Returns a List of FileSplits. Each FileSplit will be a gzipped tarball of xml documents. Each tarred file should contain a single document.

Overrides:
getSplits in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
Throws:
IOException


Copyright © 2010-2011. All Rights Reserved.