PDF decoding in Perl

Arik123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I have a PDF file which contains a filled form. Unfortunately the information (text-only) isn't plain ASCII. I nned a perl script to extract the information and process it, but I can't get anything except gibberish. I figured it's condensed somehow, so I used QPDF to make the file more human-readable.

Now there are multiple objects whose content is something like

feff05e405e805d905d8002e002e002e

which seem to be the content of the fields, in some encoding. There are also some objects that look like:

  /BaseFont /RCZMJK+TimesNewRoman
  /DescendantFonts 13 0 R
  /Encoding /Identity-H
  /Subtype /Type0
  /ToUnicode 93 0 R
  /Type /Font
[download]

while the /ToUnicode information refes to objects that look like:

93 0 obj
<<
  /Length 94 0 R
>>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
4 beginbfchar
<02A8> <05D8>
<02A9> <05D9>
<02B4> <05E4>
<02B8> <05E8>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end
endstream
endobj
[download]

I need some perl script (or a module) that can make sense of all that (to me it looks like Turkish. Hint: I don't speak Turkish) and convert it to utf-8 or some other encoding that makes sense.

Any help would be appreciated.

Comment on PDF decoding in Perl Select or Download Code

Replies are listed 'Best First'.
Re: PDF decoding in Perl by beech (Parson) on Mar 06, 2017 at 07:25 UTC
Hi, Maybe you can use CAM::PDF , it comes with https://metacpan.org/pod/distribution/CAM-PDF/bin/getpdftext.pl - Extracts and print the text from one or more PDF pages	[reply]
Re: PDF decoding in Perl by vr (Curate) on Mar 06, 2017 at 11:24 UTC
You won't solve it without consulting the PDF Reference and some rather low-level and verbose code. If you are lucky, it's indeed PDF Form, i.e. not text as page content, and not XFA form. Try `getFormFieldList` method, then `getFormField` to check them all or access a field with known name. The `V` entry in field dictionary is its text "value", either in PDFDocEncoding (plain ASCII, for most practical purposes), or UTF16-BE with prepended BOM, as in your example (which is Hebrew).	[reply] [d/l] [select]
Re: PDF decoding in Perl by huck (Prior) on Mar 06, 2017 at 07:35 UTC
you might find this useful UTF-8 text files with Byte Order Mark. also https://en.wikipedia.org/wiki/Byte_order_mark both FEFF and 0000FEFF seem to be BOM's	[reply]
Re: PDF decoding in Perl by karlgoethebier (Abbot) on Mar 06, 2017 at 10:36 UTC
Quick shot: Image::ExifTool? The Crux of the Biscuit is the Apostropheť ŤFurthermore I consider that Donald Trump must be impeached as soon as possibleť	[reply]
Re^2: PDF decoding in Perl by thanos1983 (Parson) on Mar 06, 2017 at 10:47 UTC
OMG Perl no matter how many years will pass it will never stop to amaze...incredible module...I had no clue, thanks for point it out. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re: PDF decoding in Perl by Arik123 (Beadle) on Mar 08, 2017 at 09:42 UTC
Thanks, monk! You've been tremendously helpful! I now do something like: `use CAM::PDF; use Encode "from_to"; my $pdf = CAM::PDF->new('myfile.pdf'); for ($pdf->getFormFieldList) { my $val = $pdf->getFormField($_)->{value}{value}{V}{value}; if ($val =~ /^\x{fe}\x{ff}/) {from_to ($val,"UTF-16BE", "utf8")} print "$_ => $val\n"; }` [download] and it works perfectly!	[reply] [d/l]
Re: PDF decoding in Perl by Arik123 (Beadle) on Mar 06, 2017 at 07:28 UTC
I tried CAM::PDF. It doesn't do any decompression nor decoding.	[reply]
Re^2: PDF decoding in Perl by Phenomanan (Monk) on Mar 06, 2017 at 15:59 UTC
Try this script, which is part of CAM::PDF.	[reply]
Re^2: PDF decoding in Perl by beech (Parson) on Mar 06, 2017 at 08:37 UTC
I tried CAM::PDF. It doesn't do any decompression nor decoding. Hi, What does it do?	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks