Document to Structure Developer's Guide

Version 14.9.29.0

Contents

 

Introduction

The Document to Structure product finds chemical structures in documents. Chemical names in the text of document, structures embedded in Office documents, or image drawings of structure are all supported (see the user documentation for more details). The structures can then be exported to any supported molecule format, or manipulated in memory.

 

Basic API usage

Document to Structure plugs into the generic IO API of ChemAxon. This means that documents can be used exactly as other molecular formats (sdf, ...) as a source for importing structures.

Example usage:

// We have a document to process
File document = new File("document.pdf");

MolImporter importer = new MolImporter(document, "d2s");

// Iterate through the hits
for (Molecule m : importer) {
  String smiles = m.toFormat("smiles");
  String name = m.getName();
  String sourceText = m.getProperty(DocumentToStructure.SOURCE_TEXT);
  //...
}

The exact same code can be used to import an XML file, a Microsoft Office document, ... The format is detected automatically.

The list of all available properties can be found in the API. Which property is available depends on the format. For instance, in text formats like xml, html and txt, the number of characters since the beginning of the file is available as DocumentToStructure.CHARACTER, while this has no value in a binary format.

Note that SOURCE_TEXT contains the name as it appears in the source document. A cleaned version (of possible OCR errors, typos, ...) can be retrieved with m.getName().

Processing text directly

When the text to convert is given as a String object, the MolImporter object can be constructed with:
String text = ...;
MolImporter importer = DocumentToStructure.process(text);

Configuring behavior

Document to Structure accepts options to configure how it behaves. All name to structure options can be used with document to structure as well, to configure which name conversions are attempted. For instance, by default elements and ions are not converted when using d2s, as they may occur often in documents and are not always useful. However their conversion can be enabled, using:
MolImporter importer = new MolImporter(document, "d2s:+elements,+ions");
Document to Structure has specific options as well: Each option can be precedeed by a minus sign - (for instance -smiles) to disable it. Both forms smiles and +smiles are accepted to enable an option.

Monitoring progress

For estimating the progress of converting a document, you can use the standard method MolImporter.estimateNumRecords().

Command line usage

Document to Structure can be used as any other import file format. For instance, command line usage can be achieved by using MolConverter on a format supported by Document to Structure:
molconvert sdf document.doc -o structures.sdf
 

See also

 
Copyright © 1999-2014 ChemAxon Ltd.    All rights reserved.