The strategy is to use pdftotext.exe to convert PDF into text
*yuck*
If that works, more power to you. I have always ended up with inconsistently spaced blobs of text when I first tried that route. My personal preference is to use pdftohtml.exe. I use the one included in Calibre Portable since it is actively updated.I use the following command line: pdftohtml.exe -xml -zoom 1.4 [PDF FILE]
This will rip out all the text elements into an XML file with attributes for the font, x/y position on the page and text length. (-zoom 1.4 makes the positioning units 100 dpi). Here is an example I am currently working with:<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd"> <pdf2xml> <page number="1" position="absolute" top="0" left="0" height="1100" wi +dth="850"> <fontspec id="0" size="17" family="Times" color="#000000"/> <text top="103" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="120" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="186" left="115" width="103" height="18" font="0">ROUTE TO:< +/text> <text top="186" left="265" width="107" height="17" font="0">Audit Bill +ing</text> <text top="220" left="115" width="128" height="18" font="0">SORT GROUP +:</text> <text top="220" left="265" width="152" height="18" font="0">Invoice So +rt Group</text> <text top="286" left="115" width="260" height="18" font="0">OH_GOD_IT_ +BURNS 2013-12-20</text> <text top="286" left="415" width="71" height="18" font="0">23:53:04</t +ext> <text top="286" left="545" width="108" height="18" font="0">FOOBAR</te +xt> <text top="320" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="336" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> </page> /<pdf2html>
I can then use XML::Simple to slurp each <page> element into a hash and then use Test::More's eq_hash to compare my extracted data with my reference XML hash.
In reply to Re: PDF content and visuals testing best practices
by ateague
in thread PDF content and visuals testing best practices
by andreas1234567
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |