|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.hadoop.conf.Configured
gov.llnl.ontology.mapreduce.ingest.ImportCorpusMR
public class ImportCorpusMR
This ImportCorpusMR.ImportCorpusMapper
iterates over text documents on disk and extracts
various document details and the raw document text. All of the extracted
information is stored in a CorpusTable
. The imported documents can
be extracted from two formats: a file of file paths with each path linking to
a gzipped tarball of documents or a list of xml files, each of which
contains many individual documents.
DocumentReader
is responsible for most of the
work. The provided implementation should extract the salient meta data for
each document, espcially the corpus name, and the raw document text. All of
this information will be saved in the specified CorpusTable
.
This class requires that the following types of objects be specified by the
command line:
CorpusTable
: Controls access to the document table.DocumentReader
: Reads meta data for each document.
Nested Class Summary | |
---|---|
static class |
ImportCorpusMR.ImportCorpusMapper
This Mapper iterates over text documents on disk and extracts
various document details and the raw document text. |
Field Summary | |
---|---|
static String |
CONF_PREFIX
The configuration key prefix. |
static String |
CORP
The configuration key for setting the non-default corpus name |
static String |
READER
The configuration key for setting the DocumentReader . |
static String |
TABLE
The configuration key for setting the CorpusTable . |
Constructor Summary | |
---|---|
ImportCorpusMR()
|
Method Summary | |
---|---|
static void |
main(String[] args)
Runs the IngestCorpusMR . |
int |
run(String[] args)
|
Methods inherited from class org.apache.hadoop.conf.Configured |
---|
getConf, setConf |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.apache.hadoop.conf.Configurable |
---|
getConf, setConf |
Field Detail |
---|
public static String CONF_PREFIX
public static String TABLE
CorpusTable
.
public static String READER
DocumentReader
.
public static String CORP
Constructor Detail |
---|
public ImportCorpusMR()
Method Detail |
---|
public static void main(String[] args) throws Exception
IngestCorpusMR
.
Exception
public int run(String[] args) throws Exception, InterruptedException, ClassNotFoundException
run
in interface org.apache.hadoop.util.Tool
Exception
InterruptedException
ClassNotFoundException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |