pdfx v1.0 (Fully-automated PDF-to-XML conversion of scientific text)

Posted by: bluesyemre | January 2, 2013

pdfx v1.0 (Fully-automated PDF-to-XML conversion of scientific text)

PDFX is a fully-automated PDF-to-XML converter for scientific articles. It takes a full-text PDF article as input (example) and outputs the hierarchy of its distinct logical elements in an XML format.

The elements that PDFX can currently extract are:

Front Matter

title, abstract, author, author footnote

Body Matter

body text, h1, h2, h3, image, table, figure/table caption, figure/table reference, bibliographic item, bibliographic reference (citation)

Extras

header, footer, side note, page number, email, URI
Note: This system has been designed for processing scientific articles. While virtually any PDF file is acceptable input, quality of the processing output and/or processing time might be degraded e.g. for entire books, slide presentations or spreadsheets/strictly tabular data.

http://pdfx.cs.man.ac.uk/