Document to Structure Developer's Guide

Version 14.9.29.0

Contents

 

Introduction

The DocumentExtractor class extracts chemical names from text documents and converts them to chemical structures.

 

Basic API usage

Example usage:

// We have a document to process
java.io.Reader document = ...;

DocumentExtractor x = new DocumentExtractor();
x.processHTML(document); // or processPlainText(document) for input in plain text format

// Iterate through the hits
for (Hit hit : x.getHits()) {
  System.out.println(hit.position + ": " + hit.text + ": " + hit.structure.toFormat("smiles"));
}

The field hit.position contains the position of the first character of the name in the document.

Note that hit.text contains the name as it appears in the source document. A cleaned version (of possible OCR errors, typos, ...) can be retrieved with hit.structure.getName().

This class can also be called on the command-line. It then expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.

 

See also

 
Copyright © 1999-2014 ChemAxon Ltd.    All rights reserved.