SemEvalCorpusReader (S-Space Package 2.0.1 API)

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- org.xml.sax.helpers.DefaultHandler
- - edu.ucla.sspace.text.corpora.SemEvalCorpusReader

All Implemented Interfaces:

CorpusReader<Document>, ContentHandler, DTDHandler, EntityResolver, ErrorHandler
```
public class SemEvalCorpusReader
extends org.xml.sax.helpers.DefaultHandler
implements CorpusReader<Document>
```
Reads the xml corpus files for the SemEval 2010 Word Sense Induction task, available here. Each file contains all of the contexts for a single word. The xml files should be unchanged from their original format.
This CorpusReader returns documents in the following format:
word_instance_id text ... ||| *focus_word* text ...
This is particularly neccesary for the evaluating against the SemEval testing framework which requires the focus word information and the instance id infomration.
Note that this is implemented as a DefaultHandler for a SAXParser due to difficult nature of the SemEval WSI xml format. Line based methods do not work as the entire xml document is contained on a single line. Furthermore, the test set has an addition nested tag that specifies the target sentence. This information is discarded as it is does not specify the focus word in each context. Instead, this lemmatizes each word until it finds a context word that matches the lemmatized version of the instance id.

Author:

Keith Stevens

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

class SemEvalCorpusReader.SemEvalHandler

Constructor Summary

Constructors
Constructor and Description

SemEvalCorpusReader()

Method Summary

Methods
Modifier and Type	Method and Description
`Iterator<Document>`	`read(File file)` Returns a `Iterator` that traverses the documents containted in the given `file`.
`Iterator<Document>`	`read(Reader reader)` Retrusn a `Iterator` that traverses the documents contained in `baseReader`.

Methods inherited from class org.xml.sax.helpers.DefaultHandler
characters, endDocument, endElement, endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, startDocument, startElement, startPrefixMapping, unparsedEntityDecl, warning

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - SemEvalCorpusReader
```
public SemEvalCorpusReader()
```
- Method Detail
  - read
```
public Iterator<Document> read(Reader reader)
```
    Retrusn a Iterator that traverses the documents contained in baseReader.
    
    Specified by:
    
    read in interface CorpusReader<Document>
    
    Parameters:
    reader - A Reader that will extract text from a data source, such as a URL, a File, a data stream, or any other source accesible via the Reader interface. Each CorpusReader should specify the expected text format, be it an XML schema or some other unique format.
  - read
```
public Iterator<Document> read(File file)
```
    Returns a Iterator that traverses the documents containted in the given file.
    
    Specified by:
    
    read in interface CorpusReader<Document>
    
    Parameters:
    file - A text file holding documents in a format that is readable by a particular CorpusReader. This text file may have it's own unique text structure or an xml format. Each CorpusReader should specify the expected text format.

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

Copyright © 2012. All Rights Reserved.