in reply to How to Extract PDF tables using Perl

Wow 3016 .... have to adjust my clock again.

The quick and dirty way to do this is to split /\s+/, $text

now loop over the resulting array

Like

while ( my $col = shift @array) { $date{$col} = [ shift @array, shift @array]; }

Untested!

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

Replies are listed 'Best First'.
Re^2: How to Extract PDF tables using Perl
by perlPsycho (Initiate) on May 11, 2016 at 09:39 UTC

    :D :D yeah 3016

    Thanks for the reply.
    The problem here is that the table is dynamic.

    So there may be 3 labels or 30 labels like Date,Value1and Value2
    or there may be a lot.

    Some of them might be undefined.

    Are there any modules that might help me Parse a PDF table.??

    So Far CAM::PDF and PDF::API2 does not have the feature of reading a table inside a pdf, only Creating a new one.

    Main Problem:The values get mixed and printed in a single line,
    1.)So Some of these values might not be defined(Just Empty Sets),

    And the labels keep changing,So They are not static at all.


    Any Advises or Ideas on Modules or How to do it Please..?

      i use perl but when trying to do something similar, i found using python3 + pdfquery seemed to work easier & did the column parsing...

      http://www.markhneedham.com/blog/2015/01/22/pythonpdfquery-scraping-the-fifa-world-player-of-the-year-votes-pdf-into-shape/

      i guess the nutshell is loop over each page in pdf, search for matching string, if found, get its x,y coordinates, use that result in_bbox(x,y,x2,y2) to scrape whatever else text might be inside this bounding box - because i wanted a "row" my bbox was x,y,x+500,y+10 ( grid origin at bottom left?)

      i don't know how it really works, but i was able to copy/paste enough bits to get what i needed

      maybe pdf::api or something can have similar feature in_bbox? is it maybe like a collision detection logic where given bounding box, find all text thingys that collide with it and return an array of those? i'm guessing out my a##

      sorry if this doesn't help

        doh - one minor note:
        x = float(thing.get('x0')) doesn't work(?) in latest pdfquery - per pdfquery docs use x = float(thing.attr('x0')) - 'get' maybe was replaced with 'attr'
        Thank you so much Anonymous Monk.

        Your Reply is Very Valuable And I hope We Perl Developerswould get a Piece of the Action in Perl of the Same Module.
        Looking Forward to it.
        Peace.