in reply to PDF content and visuals testing best practices
The strategy is to use pdftotext.exe to convert PDF into text
*yuck*
If that works, more power to you. I have always ended up with inconsistently spaced blobs of text when I first tried that route. My personal preference is to use pdftohtml.exe. I use the one included in Calibre Portable since it is actively updated.I use the following command line: pdftohtml.exe -xml -zoom 1.4 [PDF FILE]
This will rip out all the text elements into an XML file with attributes for the font, x/y position on the page and text length. (-zoom 1.4 makes the positioning units 100 dpi). Here is an example I am currently working with:<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd"> <pdf2xml> <page number="1" position="absolute" top="0" left="0" height="1100" wi +dth="850"> <fontspec id="0" size="17" family="Times" color="#000000"/> <text top="103" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="120" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="186" left="115" width="103" height="18" font="0">ROUTE TO:< +/text> <text top="186" left="265" width="107" height="17" font="0">Audit Bill +ing</text> <text top="220" left="115" width="128" height="18" font="0">SORT GROUP +:</text> <text top="220" left="265" width="152" height="18" font="0">Invoice So +rt Group</text> <text top="286" left="115" width="260" height="18" font="0">OH_GOD_IT_ +BURNS 2013-12-20</text> <text top="286" left="415" width="71" height="18" font="0">23:53:04</t +ext> <text top="286" left="545" width="108" height="18" font="0">FOOBAR</te +xt> <text top="320" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="336" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> </page> /<pdf2html>
I can then use XML::Simple to slurp each <page> element into a hash and then use Test::More's eq_hash to compare my extracted data with my reference XML hash.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: PDF content and visuals testing best practices
by ateague (Monk) on Dec 23, 2013 at 18:10 UTC | |
by andreas1234567 (Vicar) on Jan 03, 2014 at 09:20 UTC |