Posted by: bluesyemre | January 2, 2013

pdfx v1.0 (Fully-automated PDF-to-XML conversion of scientific text)

PDFX is a fully-automated PDF-to-XML converter for scientific articles. It takes a full-text PDF article as input (example) and outputs the hierarchy of its distinct logical elements in an XML format.

The elements that PDFX can currently extract are:

  • Front Matter

title, abstract, author, author footnote

  • Body Matter

body text, h1, h2, h3, image, table, figure/table caption, figure/table reference, bibliographic item, bibliographic reference (citation)

  • Extras

header, footer, side note, page number, email, URI
Note: This system has been designed for processing scientific articles. While virtually any PDF file is acceptable input, quality of the processing output and/or processing time might be degraded e.g. for entire books, slide presentations or spreadsheets/strictly tabular data.

http://pdfx.cs.man.ac.uk/


Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Categories