gov.llnl.ontology.text.corpora
Class NYTDocumentReader

java.lang.Object
  extended by gov.llnl.ontology.text.corpora.NYTDocumentReader
All Implemented Interfaces:
DocumentReader

public class NYTDocumentReader
extends Object
implements DocumentReader

NYTDocumentReader
Created: Jun 17, 2008
Author: Evan Sandhaus (sandhes@nytimes.com)

Class for parsing New York Times articles from NITF files.

Author:
Evan Sandhaus

Field Summary
static String DATE_PUBLICATION_ATTRIBUTE
          NITF Constant
 
Constructor Summary
NYTDocumentReader()
           
 
Method Summary
static NYTCorpusDocument parseNYTCorpusDocumentFromDOMDocument(Document document)
           
static NYTCorpusDocument parseNYTCorpusDocumentFromDOMDocument(File file, Document document)
           
static NYTCorpusDocument parseNYTCorpusDocumentFromFile(File file, boolean validating)
          Parse an New York Times Document from a file.
static NYTCorpusDocument parseNYTCorpusDocumentFromString(String str, boolean validating)
          Parse an New York Times Document from a string.
 NYTCorpusDocument readDocument(String doc)
          Returns a Document represented by the given string.
 NYTCorpusDocument readDocument(String doc, String corpusName)
          Returns a Document represented by the given string and uses corpusName as the corpus name for the returned Document.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DATE_PUBLICATION_ATTRIBUTE

public static final String DATE_PUBLICATION_ATTRIBUTE
NITF Constant

See Also:
Constant Field Values
Constructor Detail

NYTDocumentReader

public NYTDocumentReader()
Method Detail

readDocument

public NYTCorpusDocument readDocument(String doc)
Returns a Document represented by the given string.

Specified by:
readDocument in interface DocumentReader

readDocument

public NYTCorpusDocument readDocument(String doc,
                                      String corpusName)
Returns a Document represented by the given string and uses corpusName as the corpus name for the returned Document.

Specified by:
readDocument in interface DocumentReader

parseNYTCorpusDocumentFromFile

public static NYTCorpusDocument parseNYTCorpusDocumentFromFile(File file,
                                                               boolean validating)
Parse an New York Times Document from a file.

Parameters:
file - The file from which to parse the document.
disableValidation - True if the file is to be validated against the nitf DTD and false if it is not. It is recommended that validation be disabled, as all documents in the corpus have previously been validated against the NITF DTD.
Returns:
The parsed document, or null if an error occurs.

parseNYTCorpusDocumentFromString

public static NYTCorpusDocument parseNYTCorpusDocumentFromString(String str,
                                                                 boolean validating)
Parse an New York Times Document from a string.

Parameters:
str - The file from which to parse the document.
disableValidation - True if the file is to be validated against the nitf DTD and false if it is not. It is recommended that validation be disabled, as all documents in the corpus have previously been validated against the NITF DTD.
Returns:
The parsed document, or null if an error occurs.

parseNYTCorpusDocumentFromDOMDocument

public static NYTCorpusDocument parseNYTCorpusDocumentFromDOMDocument(File file,
                                                                      Document document)

parseNYTCorpusDocumentFromDOMDocument

public static NYTCorpusDocument parseNYTCorpusDocumentFromDOMDocument(Document document)


Copyright © 2010-2011. All Rights Reserved.