Hi Monks!
I have a PDF file which contains a filled form. Unfortunately the information (text-only) isn't plain ASCII. I nned a perl script to extract the information and process it, but I can't get anything except gibberish. I figured it's condensed somehow, so I used QPDF to make the file more human-readable.
Now there are multiple objects whose content is something like
feff05e405e805d905d8002e002e002ewhich seem to be the content of the fields, in some encoding. There are also some objects that look like:
/BaseFont /RCZMJK+TimesNewRoman /DescendantFonts 13 0 R /Encoding /Identity-H /Subtype /Type0 /ToUnicode 93 0 R /Type /Font
while the /ToUnicode information refes to objects that look like:
93 0 obj << /Length 94 0 R >> stream /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 4 beginbfchar <02A8> <05D8> <02A9> <05D9> <02B4> <05E4> <02B8> <05E8> endbfchar endcmap CMapName currentdict /CMap defineresource pop end end endstream endobj
I need some perl script (or a module) that can make sense of all that (to me it looks like Turkish. Hint: I don't speak Turkish) and convert it to utf-8 or some other encoding that makes sense.
Any help would be appreciated.
In reply to PDF decoding in Perl by Arik123
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |