in reply to Re^2: How to Extract PDF tables using Perl
in thread How to Extract PDF tables using Perl
i use perl but when trying to do something similar, i found using python3 + pdfquery seemed to work easier & did the column parsing...
i guess the nutshell is loop over each page in pdf, search for matching string, if found, get its x,y coordinates, use that result in_bbox(x,y,x2,y2) to scrape whatever else text might be inside this bounding box - because i wanted a "row" my bbox was x,y,x+500,y+10 ( grid origin at bottom left?)
i don't know how it really works, but i was able to copy/paste enough bits to get what i needed
maybe pdf::api or something can have similar feature in_bbox? is it maybe like a collision detection logic where given bounding box, find all text thingys that collide with it and return an array of those? i'm guessing out my a##
sorry if this doesn't help
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: How to Extract PDF tables using Perl
by Anonymous Monk on May 25, 2016 at 06:04 UTC | |
|
Re^4: How to Extract PDF tables using Perl
by perlPsycho (Initiate) on May 27, 2016 at 06:53 UTC |