The DocumentExtractor class extracts chemical names from text documents and converts them to chemical structures.
Example usage:
// We have a document to process java.io.Reader document = ...; DocumentExtractor x = new DocumentExtractor(); x.processHTML(document); // or processPlainText(document) for input in plain text format // Iterate through the hits for (Hit hit : x.getHits()) { System.out.println(hit.position + ": " + hit.text + ": " + hit.structure.toFormat("smiles")); }
The field hit.position contains the position of the first character of the name in the document.
Note that hit.text contains the name as it appears in the source document. A cleaned version
(of possible OCR errors, typos, ...) can be retrieved with hit.structure.getName()
.
This class can also be called on the command-line. It then expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.