mr_p has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am using CAM::PDF and get the below error message. I have searched the internet and found that this is existing problem which came after Adobe 9 which was released in 6/25/2008 and last change made to CAM::PDF was 10/2008, which leads me to believe that this problem was never fixed.

Has anyone come across this problem and guide me to what to do? I have read previous threads and they were not helpful to me.

The PDF is not corrupted. It is from legit source.

sub convert_pdf_to_text { use CAM::PDF; use CAM::PDF::PageText; my $pdf = CAM::PDF->new($pdf_filename) || die "$CAM::PDF::errstr\n +"; my $y=CAM::PDF::PageText->render($pdf->getPageContentTree(1)); print "$y\n"; } }

This is error string from: $CAM::PDF::errstr

Invalid xref stream: could not decode objstream 68
Thanks

Replies are listed 'Best First'.
Re: Converting Text from PDF using CAM::PDF
by almut (Canon) on Jun 23, 2010 at 00:09 UTC

    The problem is that there are several versions of the PDF format (from 1.0 to 1.7).  Over the years, many extensions have been introduced, and some of the newer ones are not supported by CAM::PDF.  One of them (apparently) is compressed xref tables — the xref table is a list of byte offsets pointing to where the individual objects are stored within the file, which in older versions was always uncompressed.  This new feature is being used in the sample PDF file you linked to (which is PDF-1.6).

    You can often work around such problems by using another tool to change the internal format of the PDF file. qpdf is a pretty good one, which provides quite a number of options to play with.  For example, you could try:

    $ qpdf --stream-data=uncompress in.pdf out.pdf

    (and optionally re-compress it with --stream-data=compress, if size matters)

    After applying this procedure to the PDF in question, the converted file(s) could successfully be read by CAM::PDF.

      Note, PDF::API2 also can't handle compressed xrefs. Ghostscript can also be used to convert the format, but it's very slow.
Re: Converting Text from PDF using CAM::PDF
by Khen1950fx (Canon) on Jun 23, 2010 at 05:32 UTC
    If you can't get CAM::PDF to work, then try xpdf. It has the pdftotext utility. I tried it on your file, and it worked:
    #!/usr/bin/perl use strict; use warnings; open (FILE, "pdftotext -f 1 /root/Desktop/urfile.pdf - |"); my $file = <FILE>; print "$file\n"; close FILE;
      Yes, I am aware of pdftotext...I was really hoping that it was in perl. I always believe that perl have modules for everything and they work, u never have to go outside of it.

      Thanks so much.

Re: Converting Text from PDF using CAM::PDF
by Khen1950fx (Canon) on Jun 22, 2010 at 20:13 UTC
    You're working too hard at it. Relax a little, and it'll work. I reworked your script, and it works with no errors. I really couldn't see any reason to use errstr, so I eliminated it.
    #!/usr/bin/perl use strict; use warnings; use CAM::PDF; convert_pdf_to_text(); sub convert_pdf_to_text { use CAM::PDF::PageText; my $pdf_filename = '/path/to/pdf'; my $pdf = CAM::PDF->new($pdf_filename); my $y = $pdf->getPageContentTree(1); print CAM::PDF::PageText->render($y); }
      This is the code that I have...

      please run this code and tell me if it works?

      It is same code as yours. Since I can't upload my pdf file, I have searched and found another pdf file it fails on. I don't know if it has to do with tables.

      If you run u'r code on this, it will not work.

      #!/usr/bin/perl use Data::Dumper; use LWP::UserAgent; my $pdf_filename="/tmp/file.pdf"; my $pdf_link = "http://investor.google.com/pdf/2010Q1_earnings_google. +pdf"; $client = LWP::UserAgent->new(); my $capture = $client->get("$pdf_link", ":content_file" => "$pdf_filen +ame"); convert_pdf_to_text(); sub convert_pdf_to_text { use CAM::PDF; use CAM::PDF::PageText; my $pdf = CAM::PDF->new($pdf_filename) || die "$CAM::PDF::errstr\n +"; foreach (1..($pdf->numPages())) { my $x=CAM::PDF::PageText->render($pdf->getPageContentTree($_)) +; print "$x\n"; } }
        As requested:

        Took the trouble to actually download the cited .pdf; saved to my \pl_test dir.

        Modified your script to use the .pdf from the local dir as above.

        W2k; perl -v: v5.8.8 built (819) for MSWin32-x86-multi-thread.

        Using PPM, d/loaded and installed from Bribes: CAM-PDF-1.52 & various prereqs.

        >perl -c F:\_wo\pl_test\pdftest.pl F:\_wo\pl_test\pdftest.pl syntax OK >perl F:\_wo\pl_test\pdftest.pl Invalid xref stream: could not decode objstream 68

        Looks familiar. Sorry, 5.10.1/linux not avail; hot weather cooked that box rather thoroughly.

        You're right---it doesn't work:); however, as I see it, the problem isn't with CAM::PDF but rather with the google pdf. I think that google is great for searching the web, but when it comes to anything else, it's not so good. I've tried the code with my own non-google pdf's and it works. Try this, and let me know if it works or not:
        #!/usr/bin/perl use strict; use warnings; use CAM::PDF; use LWP::UserAgent; my $pdf_filename = '/root/Desktop/perl.pdf'; my $pdf_link = 'http://www.greenteapress.com/perl/perl.pdf'; my $client = LWP::UserAgent->new(); my $capture = $client->get("$pdf_link", ":content_file" => "$pdf_filen +ame"); convert_pdf_to_text(); sub convert_pdf_to_text { use CAM::PDF::PageText; my $pdf_filename = '/root/Desktop/perl.pdf'; my $pdf = CAM::PDF->new($pdf_filename); my $y = $pdf->getPageContentTree(1); print CAM::PDF::PageText->render($y); }