gov.llnl.ontology.text.corpora
Class PubMedDocumentReader

java.lang.Object
  extended by org.xml.sax.helpers.DefaultHandler
      extended by gov.llnl.ontology.text.corpora.PubMedDocumentReader
All Implemented Interfaces:
DocumentReader, ContentHandler, DTDHandler, EntityResolver, ErrorHandler

public class PubMedDocumentReader
extends DefaultHandler
implements DocumentReader

A DocumentReader for the PubMed corpus. PubMed is formatted as a series of documents in a single xml file. this DocumentReader works as a DefaultHandler for the SAXParser and will read one full document per call to readDocument(java.lang.String, java.lang.String). Text in NameOfSubstance tags are the document labels, text in ArticleTitle is the title, text in PMID serves as the id and key value, and text in Abstract is the raw document text.

This is not thread safe.

Author:
Keith Stevens

Constructor Summary
PubMedDocumentReader()
          Creates a new PubMedDocumentReader
 
Method Summary
 void characters(char[] ch, int start, int length)
           
 void endElement(String uri, String localName, String name)
           
 Document readDocument(String originalText)
          Returns a Document represented by the given string.
 Document readDocument(String originalText, String corpusName)
          Returns a Document represented by the given string and uses corpusName as the corpus name for the returned Document.
 void startElement(String uri, String localName, String name, Attributes atts)
           
 
Methods inherited from class org.xml.sax.helpers.DefaultHandler
endDocument, endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, startDocument, startPrefixMapping, unparsedEntityDecl, warning
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PubMedDocumentReader

public PubMedDocumentReader()
Creates a new PubMedDocumentReader

Method Detail

readDocument

public Document readDocument(String originalText,
                             String corpusName)
Returns a Document represented by the given string and uses corpusName as the corpus name for the returned Document.

Specified by:
readDocument in interface DocumentReader

readDocument

public Document readDocument(String originalText)
Returns a Document represented by the given string.

Specified by:
readDocument in interface DocumentReader

startElement

public void startElement(String uri,
                         String localName,
                         String name,
                         Attributes atts)
                  throws SAXException
Specified by:
startElement in interface ContentHandler
Overrides:
startElement in class DefaultHandler
Throws:
SAXException

characters

public void characters(char[] ch,
                       int start,
                       int length)
                throws SAXException
Specified by:
characters in interface ContentHandler
Overrides:
characters in class DefaultHandler
Throws:
SAXException

endElement

public void endElement(String uri,
                       String localName,
                       String name)
                throws SAXException
Specified by:
endElement in interface ContentHandler
Overrides:
endElement in class DefaultHandler
Throws:
SAXException


Copyright © 2010-2011. All Rights Reserved.