Document to Structure Developer's Guide

Version 14.9.29.0

Introduction
Basic API usage

Introduction

The DocumentExtractor class extracts chemical names from text documents and converts them to chemical structures.

Basic API usage

Example usage:

// We have a document to process
java.io.Reader document = ...;

DocumentExtractor x = new DocumentExtractor();
x.processHTML(document); // or processPlainText(document) for input in plain text format

// Iterate through the hits
for (Hit hit : x.getHits()) {
  System.out.println(hit.position + ": " + hit.text + ": " + hit.structure.toFormat("smiles"));
}

The field hit.position contains the position of the first character of the name in the document.

Note that hit.text contains the name as it appears in the source document. A cleaned version (of possible OCR errors, typos, ...) can be retrieved with hit.structure.getName().

This class can also be called on the command-line. It then expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.

Document to Structure Developer's Guide

Version 14.9.29.0

Contents

Introduction

Basic API usage

See also