PDFX is a fully-automated PDF-to-XML converter for scientific articles. It takes a full-text PDF article as input (example) and outputs the hierarchy of its distinct logical elements in an XML format.
The elements that PDFX can currently extract are:
- Front Matter
title, abstract, author, author footnote
- Body Matter
body text, h1, h2, h3, image, table, figure/table caption, figure/table reference, bibliographic item, bibliographic reference (citation)
- Extras
header, footer, side note, page number, email, URI
Note: This system has been designed for processing scientific articles. While virtually any PDF file is acceptable input, quality of the processing output and/or processing time might be degraded e.g. for entire books, slide presentations or spreadsheets/strictly tabular data.
Leave a comment