gov.llnl.ontology.mapreduce.ingest
Class ImportCorpusMR

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by gov.llnl.ontology.mapreduce.ingest.ImportCorpusMR
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class ImportCorpusMR
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool

This ImportCorpusMR.ImportCorpusMapper iterates over text documents on disk and extracts various document details and the raw document text. All of the extracted information is stored in a CorpusTable. The imported documents can be extracted from two formats: a file of file paths with each path linking to a gzipped tarball of documents or a list of xml files, each of which contains many individual documents.

When processing, a DocumentReader is responsible for most of the work. The provided implementation should extract the salient meta data for each document, espcially the corpus name, and the raw document text. All of this information will be saved in the specified CorpusTable.

This class requires that the following types of objects be specified by the command line:

Author:
Keith Stevens

Nested Class Summary
static class ImportCorpusMR.ImportCorpusMapper
          This Mapper iterates over text documents on disk and extracts various document details and the raw document text.
 
Field Summary
static String CONF_PREFIX
          The configuration key prefix.
static String CORP
          The configuration key for setting the non-default corpus name
static String READER
          The configuration key for setting the DocumentReader.
static String TABLE
          The configuration key for setting the CorpusTable.
 
Constructor Summary
ImportCorpusMR()
           
 
Method Summary
static void main(String[] args)
          Runs the IngestCorpusMR.
 int run(String[] args)
          
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

CONF_PREFIX

public static String CONF_PREFIX
The configuration key prefix.


TABLE

public static String TABLE
The configuration key for setting the CorpusTable.


READER

public static String READER
The configuration key for setting the DocumentReader.


CORP

public static String CORP
The configuration key for setting the non-default corpus name

Constructor Detail

ImportCorpusMR

public ImportCorpusMR()
Method Detail

main

public static void main(String[] args)
                 throws Exception
Runs the IngestCorpusMR.

Throws:
Exception

run

public int run(String[] args)
        throws Exception,
               InterruptedException,
               ClassNotFoundException

Specified by:
run in interface org.apache.hadoop.util.Tool
Throws:
Exception
InterruptedException
ClassNotFoundException


Copyright © 2010-2011. All Rights Reserved.