public interface CorpusReader<D extends Document>
CorpusReader, which reads un
cleaned text from corpus files and transforms them into an appropriately
cleaned Document instance.Iterator<D> read(File file)
Iterator that traverses the documents containted in
the given file.file - A text file holding documents in a format
that is readable by a particular CorpusReader. This text
file may have it's own unique text structure or an xml format.
Each CorpusReader should specify the expected text format.Iterator<D> read(Reader baseReader)
Iterator that traverses the documents contained in
baseReader.baseReader - A Reader that will extract text from a data
source, such as a URL, a File, a data stream, or any other source
accesible via the Reader interface. Each CorpusReader should specify the expected text format, be it an
XML schema or some other unique format.Copyright © 2012. All Rights Reserved.