Use pdftohtml with the -xml option:
pdftohtml -xml file.pdf
In pdftohtml-0.36, this creates invalid XML output. But it is easy to fix up this XML with a few regular expressions to create valid XML. Then use your favorite XML parser to process the XML. My favorite XML parser is Twig.
In reply to Re: PDF Parsing
by toma
in thread PDF Parsing
by weismat
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |