| [reply] |
I have actually had a look at those modules, but all they do is create/manipulate pdfs. eg PDF::API2 has a fn $string = $pdf->stringify, but this just dumps the file into a string still as pdf format ie you get a load of binary rubbish.
As for PDF::Extract - "Extracting sub PDF documents from a multi page PDF document"; again output is pdf.
I just need the bare ascii text that pdftotext gives, except it has the odd random glitch which makes the output corrupted in terms of layout.
If I can't predict the layout, I can't parse it.
| [reply] |
You could use pdftoall (not perl) or pdf2txt also not perl but shareware. for another good free util try pstotext (in debian apt-get install pstotext) | [reply] |
I would note that the ps2ascii application which is installed as part of the pstotext package does also include PDF to ASCII conversion (although the manual page does note that this application does not consider font encoding and cannot handle kerning particularly well) - If there is a requirement for a Perl implementation of this solution, the source package for this application may be enlightening.
perl -le "print unpack'N', pack'B32', '00000000000000000000001000000000'"
| [reply] |
| [reply] |
It works on most lines, but occasionally gets confused and outputs data in a different layout from the original.
I really need it to be accurate because even the original is fiddly to deal with.
It's basically 2 sets of columns of variable blocks of data, that also wrap around from the bottom of the left-hand column to the top of the right hand column on each page, then wraps to the top of the left-hand column on the next page etc ...
eg (short example):
name1 1,2,3,4 8,9,10
name3 1,2,3,4,5
name 1,2,34, name4 1,2,3,4
5,6,7,
but what i sometimes get is:
name1 1,2,3,4 8,9,10
name3 1,2,3,4,5
1,2,3,4
name4
name 1,2,34,
5,6,7
Ther's a lot more of this ... also, the separations between names, nums, left, right cols are variable... | [reply] [d/l] [select] |