i use perl but when trying to do something similar, i found using python3 + pdfquery seemed to work easier & did the column parsing...
i guess the nutshell is loop over each page in pdf, search for matching string, if found, get its x,y coordinates, use that result in_bbox(x,y,x2,y2) to scrape whatever else text might be inside this bounding box - because i wanted a "row" my bbox was x,y,x+500,y+10 ( grid origin at bottom left?)
i don't know how it really works, but i was able to copy/paste enough bits to get what i needed
maybe pdf::api or something can have similar feature in_bbox? is it maybe like a collision detection logic where given bounding box, find all text thingys that collide with it and return an array of those? i'm guessing out my a##
sorry if this doesn't help
In reply to Re^3: How to Extract PDF tables using Perl
by Anonymous Monk
in thread How to Extract PDF tables using Perl
by perlPsycho
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |