Use pdftohtml with the -xml option:
pdftohtml -xml file.pdf
In pdftohtml-0.36, this creates invalid XML output. But it is easy to fix up this XML with a few regular expressions to create valid XML. Then use your favorite XML parser to process the XML. My favorite XML parser is Twig.
In reply to Re: PDF Parsing
by toma
in thread PDF Parsing
by weismat
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |