in reply to PDF Parsing
Use pdftohtml with the -xml option:
pdftohtml -xml file.pdf
In pdftohtml-0.36, this creates invalid XML output. But it is easy to fix up this XML with a few regular expressions to create valid XML. Then use your favorite XML parser to process the XML. My favorite XML parser is Twig.
|
---|