DocumentExtractor (Marvin API documentation (c) 1998-2014 ChemAxon Ltd.)

java.lang.Object
- chemaxon.naming.DocumentExtractor

Deprecated.
use chemaxon.formats.MolImporter or chemaxon.naming.DocumentToStructure instead.
```
public class DocumentExtractor
extends Object
```
Extracts chemical names from text documents and converts them to chemical structures. Example usage:
```
// We have a document to process
java.io.Reader document = ...;

DocumentExtractor x = new DocumentExtractor();
x.processHTML(document); // or processPlainText(document) for input in plain text format

// Iterate through the hits (using a Java 1.5 feature, otherwise use an java.util.Iterator)
for (Hit hit : x.getHits()) {
  System.out.println(hit.position + ": " + hit.text + ": " + hit.structure.toFormat("smiles"));
}
```
The field hit.position contains the position of the first character of the name in the document.
Note that hit.text contains the name as it appears in the source document. A cleaned version (of possible OCR errors, typos, ...) can be retrieved with hit.structure.getName().
This class can also be called on the command-line. It then expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.
Author:

Daniel Bonniot

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`class`	`DocumentExtractor.Hit` Deprecated. An occurrence of a chemical name in the processed document.
`static class`	`DocumentExtractor.ProgressInfo` Deprecated.
`static interface`	`DocumentExtractor.ProgressListener` Deprecated.

Field Summary

Fields
Modifier and Type Field and Description

static String propertyPage
Deprecated.

static String propertySourceDocument
Deprecated.

Fields
Modifier and Type	Field and Description
`static String`	`propertyPage` Deprecated.
`static String`	`propertySourceDocument` Deprecated.

Constructor Summary

Constructors
Constructor and Description
`DocumentExtractor()` Deprecated. Creates a new document extractor.
`DocumentExtractor(File document)` Deprecated. If the file name ends with ".gz", the content will be uncompressed automatically.
`DocumentExtractor(File document, String encoding)` Deprecated. If the file name ends with ".gz", the content will be uncompressed automatically.
`DocumentExtractor(Reader r)` Deprecated.
`DocumentExtractor(String text)` Deprecated. Extract structures from a String.
`DocumentExtractor(URL document)` Deprecated.
`DocumentExtractor(URLConnection document)` Deprecated. Create a document extractor for the given URL connection.

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`acceptElements(boolean on)` Deprecated.
`void`	`acceptGenericNames(boolean on)` Deprecated. Whether to accept generic, frequent names like "water".
`void`	`acceptGroups(boolean on)` Deprecated.
`void`	`acceptIons(boolean on)` Deprecated.
`void`	`clearHits()` Deprecated. Clears the list of hits.
`List<DocumentExtractor.Hit>`	`getHits()` Deprecated. Returns the hits found in the documents processed so far.
`static void`	`main(String[] args)` Deprecated. Expects the name of a plain text file as the first argument (or from the standard input when absent).
`static void`	`printEncodingError()` Deprecated.
`void`	`processHTML()` Deprecated. Extract names from an HTML document.
`void`	`processHTML(DocumentExtractor.ProgressListener progressListener)` Deprecated. Extract names from an HTML document.
`void`	`processHTML(Reader r)` Deprecated. Extract names from an HTML document.
`void`	`processPlainText()` Deprecated. Extract names from a plain text document.
`void`	`processPlainText(DocumentExtractor.ProgressListener progressListener)` Deprecated. Extract names from a plain text document.
`void`	`processPlainText(Reader r)` Deprecated. Extract names from a plain text document.
`static DocumentExtractor`	`readPDF(File pdf)` Deprecated. Creates a DocumentExtractor to process the given PDF document.
`static DocumentExtractor`	`readPDF(InputStream pdfStream)` Deprecated. Creates a DocumentExtractor to process the given PDF document.
`void`	`setCasNumberLookup(boolean value)` Deprecated. Enable or disable the lookup of CAS numbers (requires network access).

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - propertySourceDocument
```
public static final String propertySourceDocument
```
    Deprecated.
    
    See Also:
    Constant Field Values
  - propertyPage
```
public static final String propertyPage
```
    Deprecated.
    
    See Also:
    Constant Field Values
- Constructor Detail
  - DocumentExtractor
```
public DocumentExtractor()
```
    Deprecated.
    
    Creates a new document extractor.
  - DocumentExtractor
```
public DocumentExtractor(File document)
                  throws IOException
```
    Deprecated.
    
    If the file name ends with ".gz", the content will be uncompressed automatically. The getEstimatedTotalCharacters() will be wrong in this case though (to be implemented, request if needed).
    
    Throws:
    
    IOException
  - DocumentExtractor
```
public DocumentExtractor(File document,
                 String encoding)
                  throws IOException
```
    Deprecated.
    
    If the file name ends with ".gz", the content will be uncompressed automatically. The getEstimatedTotalCharacters() will be wrong in this case though (to be implemented, request if needed).
    
    Throws:
    
    IOException
  - DocumentExtractor
```
public DocumentExtractor(URL document)
                  throws IOException
```
    Deprecated.
    
    Throws:
    
    IOException
  - DocumentExtractor
```
public DocumentExtractor(URLConnection document)
                  throws IOException
```
    Deprecated.
    
    Create a document extractor for the given URL connection. This constructor is also useful when using a java.net.Proxy class, by using the URL.openConnection(java.net.Proxy) method to obtain the URLConnection.
    
    Throws:
    
    IOException
  - DocumentExtractor
```
public DocumentExtractor(Reader r)
```
    Deprecated.
  - DocumentExtractor
```
public DocumentExtractor(String text)
```
    Deprecated.
    
    Extract structures from a String.
    
    Since:
    
    5.8
- Method Detail
  - setCasNumberLookup
```
public void setCasNumberLookup(boolean value)
```
    Deprecated.
    
    Enable or disable the lookup of CAS numbers (requires network access). Disabled by default.
  - acceptElements
```
public void acceptElements(boolean on)
```
    Deprecated.
  - acceptIons
```
public void acceptIons(boolean on)
```
    Deprecated.
  - acceptGenericNames
```
public void acceptGenericNames(boolean on)
```
    Deprecated.
    
    Whether to accept generic, frequent names like "water".
  - acceptGroups
```
public void acceptGroups(boolean on)
```
    Deprecated.
  - main
```
public static void main(String[] args)
```
    Deprecated.
    
    Expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.
  - printEncodingError
```
public static void printEncodingError()
```
    Deprecated.
  - processPlainText
```
public void processPlainText(Reader r)
                      throws IOException
```
    Deprecated.
    
    Extract names from a plain text document. Buffering is done internally, so passing a BufferedReader is not necessary.
    
    Throws:
    
    IOException
  - processPlainText
```
public void processPlainText()
                      throws IOException
```
    Deprecated.
    
    Extract names from a plain text document.
    
    Throws:
    
    IOException
  - processPlainText
```
public void processPlainText(DocumentExtractor.ProgressListener progressListener)
                      throws IOException
```
    Deprecated.
    
    Extract names from a plain text document. Buffering is done internally, so passing a BufferedReader is not necessary.
    
    Throws:
    
    IOException
  - processHTML
```
public void processHTML(Reader r)
                 throws IOException
```
    Deprecated.
    
    Extract names from an HTML document. Buffering is done internally, so passing a BufferedReader is not necessary.
    
    Throws:
    
    IOException
  - processHTML
```
public void processHTML()
                 throws IOException
```
    Deprecated.
    
    Extract names from an HTML document. Buffering is done internally, so passing a BufferedReader is not necessary.
    
    Throws:
    
    IOException
  - processHTML
```
public void processHTML(DocumentExtractor.ProgressListener progressListener)
                 throws IOException
```
    Deprecated.
    
    Extract names from an HTML document. Buffering is done internally, so passing a BufferedReader is not necessary.
    
    Throws:
    
    IOException
  - getHits
```
public List<DocumentExtractor.Hit> getHits()
```
    Deprecated.
    
    Returns the hits found in the documents processed so far.
  - clearHits
```
public void clearHits()
```
    Deprecated.
    
    Clears the list of hits.
  - readPDF
```
public static DocumentExtractor readPDF(File pdf)
                                 throws IOException
```
    Deprecated.
    
    Creates a DocumentExtractor to process the given PDF document.
    
    Throws:
    
    IOException
  - readPDF
```
public static DocumentExtractor readPDF(InputStream pdfStream)
                                 throws IOException
```
    Deprecated.
    
    Creates a DocumentExtractor to process the given PDF document.
    
    Throws:
    
    IOException

Class DocumentExtractor

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

propertySourceDocument

propertyPage

Constructor Detail

DocumentExtractor

DocumentExtractor

DocumentExtractor

DocumentExtractor

DocumentExtractor

DocumentExtractor

DocumentExtractor

Method Detail

setCasNumberLookup

acceptElements

acceptIons

acceptGenericNames

acceptGroups

main

printEncodingError

processPlainText

processPlainText

processPlainText

processHTML

processHTML

processHTML

getHits

clearHits

readPDF

readPDF