public class SemEvalCorpusReader extends org.xml.sax.helpers.DefaultHandler implements CorpusReader<Document>
CorpusReader
returns documents in the following format:
word_instance_id text ... ||| *focus_word* text ...
This is particularly neccesary for the evaluating against the SemEval testing
framework which requires the focus word information and the instance id
infomration.
Note that this is implemented as a DefaultHandler
for a SAXParser
due to difficult nature of the SemEval WSI xml format. Line based
methods do not work as the entire xml document is contained on a single line.
Furthermore, the test set has an addition nested tag that specifies the
target sentence. This information is discarded as it is does not specify the
focus word in each context. Instead, this lemmatizes each word until it
finds a context word that matches the lemmatized version of the instance id.Modifier and Type | Class and Description |
---|---|
class |
SemEvalCorpusReader.SemEvalHandler |
Constructor and Description |
---|
SemEvalCorpusReader() |
Modifier and Type | Method and Description |
---|---|
Iterator<Document> |
read(File file)
Returns a
Iterator that traverses the documents containted in
the given file . |
Iterator<Document> |
read(Reader reader)
Retrusn a
Iterator that traverses the documents contained in
baseReader . |
characters, endDocument, endElement, endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, startDocument, startElement, startPrefixMapping, unparsedEntityDecl, warning
public Iterator<Document> read(Reader reader)
Iterator
that traverses the documents contained in
baseReader
.read
in interface CorpusReader<Document>
reader
- A Reader
that will extract text from a data
source, such as a URL, a File, a data stream, or any other source
accesible via the Reader
interface. Each CorpusReader
should specify the expected text format, be it an
XML schema or some other unique format.public Iterator<Document> read(File file)
Iterator
that traverses the documents containted in
the given file
.read
in interface CorpusReader<Document>
file
- A text file holding documents in a format
that is readable by a particular CorpusReader
. This text
file may have it's own unique text structure or an xml format.
Each CorpusReader
should specify the expected text format.Copyright © 2012. All Rights Reserved.