tosaiju has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Do we have a routine/parser which extract texts from a PDF with all its information like (x, y) co-ordinates, font size, location, width etc.

e.g: lets take a pdf looks like below.
Name: XYZ  :                                            Date:dd/mm/yyyy
Address: QWER                                      Time:hh MM ss

and to extract something like below,

(x-start:10, x-end:40, y:-100) = Name:
(x-start:45, x-end:100, y:-100) = XYZ
(x-start:300, x-end:340, y:-100) = Date:
(x-start:345, x-end:400, y:-100) = dd/mm/yyyy

(x-start:10, x-end:50, y:-50) = Address:
(x-start:55, x-end:100, y:-50) = QWER
(x-start:300, x-end:340, y:-50) = Time:
(x-start:345, x-end:390, y:-50) = hh MM ss


Many Thanks,

Replies are listed 'Best First'.
Re: PDF Parser
by LanX (Saint) on Mar 18, 2014 at 11:11 UTC
    My best bet is to use pdftohtml -xml and to parse the xml.

    See also Parsing PDFs by text position?

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      Thx for the tip.

      Maybe I can solve one of my open problems this way: reconstruct the text of a book in Yiddish (accented Hebrew), where the accents are added by position. With pdftotext the accents appear at the end of the line.

        Well while learning to read Yiddish is on my to-do list, I never thought about doing it via PDF ;)

        The C sources of pdftohtml are pretty compact calls to something like ghostscript (IIRC)¹ so porting it to Perl in order to have tighter control shouldn't be a problem.

        HTH :)

        Cheers Rolf

        ( addicted to the Perl Programming Language)

        update

        nope it's XPDF! :)

Re: PDF Parser
by ateague (Monk) on Mar 18, 2014 at 14:32 UTC
    Just to comment on what LanX said:

    I use the following command line: pdftohtml.exe -xml -stdout -zoom 1.4 [PDF FILE']

    This will rip out all the text elements into an XML file with attributes for the font, x/y position on the page and text length.
    (-zoom 1.4 makes the positioning units 100 dpi, -stdout streams the output to STDOUT instead of writing it to a file).

    Here is an example I am currently working with:

    <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd"> <pdf2xml> <page number="1" position="absolute" top="0" left="0" height="1100" wi +dth="850"> <fontspec id="0" size="17" family="Times" color="#000000"/> <text top="103" left="115" width="602" height="18" font="0">XXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="120" left="115" width="602" height="18" font="0">XXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="186" left="115" width="103" height="18" font="0">ROUTE TO: +</text> <text top="186" left="265" width="107" height="17" font="0">Audit Bil +ling</text> <text top="220" left="115" width="128" height="18" font="0">SORT GROU +P:</text> <text top="220" left="265" width="152" height="18" font="0">Invoice S +ort Group</text> <text top="286" left="115" width="260" height="18" font="0">OH_GOD_IT +_BURNS 2013-12-20</text> <text top="286" left="415" width="71" height="18" font="0">23:53:04</ +text> <text top="286" left="545" width="108" height="18" font="0">FOOBAR</t +ext> <text top="320" left="115" width="602" height="18" font="0">XXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="336" left="115" width="602" height="18" font="0">XXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> </page> /<pdf2html>
    I can then use XML::Twig with XPath expressions to pull the exact xml nodes I want:
    open (my $XML, "-|", "pdftohtml.exe -xml -zoom 1.4 -stdout $PDF_FILE") + or die "$!\n$^E"; # We are only interested in the text for the "ROUTE TO:" and "SORT + GROUP:" sections # Set the twig_handlers to extract the <text> nodes of interest; a +ll other nodes will be ignored # XPath queries provide an extra 1/20 inch padding on all sides to + take font and rendering variations into account my $t = XML::Twig->new( twig_handlers => { '//text[(@top >= 180 and @top <= 190) and (@left >= 100 an +d @left <= 111)]' => \&RouteTo, '//text[(@top >= 215 and @top <= 225) and (@left >= 260 an +d @left <= 270)]' => \&InvoiceSort, }, comments => 'drop', # remove any comments empty_tags => 'normal',# empty tags = <tag/> ); $t->parse($XML); $t->purge; close $XML;