gov.llnl.ontology.text.hbase
Class XMLInputFormat

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<K,V>
      extended by org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
          extended by gov.llnl.ontology.text.hbase.XMLInputFormat

public class XMLInputFormat
extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>

A FileInputFormat for handling gzipped tarball files with each internal file containing data for a single document.

Author:
Keith Stevens

Constructor Summary
XMLInputFormat()
           
 
Method Summary
 org.apache.hadoop.mapreduce.RecordReader createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)
          Returns a XMLRecordReader.
 List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext job)
          Generate the list of files and make them into FileSplits.
static void setXMLTags(org.apache.hadoop.mapreduce.Job job, String delimiter)
           
 
Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, isSplitable, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

XMLInputFormat

public XMLInputFormat()
Method Detail

setXMLTags

public static void setXMLTags(org.apache.hadoop.mapreduce.Job job,
                              String delimiter)

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                   org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                            throws IOException,
                                                                   InterruptedException
Returns a XMLRecordReader. The record reader will return each tarred file.

Specified by:
createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
Throws:
IOException
InterruptedException

getSplits

public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext job)
                                                       throws IOException
Generate the list of files and make them into FileSplits.

Overrides:
getSplits in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.io.Text>
Throws:
IOException


Copyright © 2010-2011. All Rights Reserved.