in reply to Re: Converting Text from PDF using CAM::PDF
in thread Converting Text from PDF using CAM::PDF

This is the code that I have...

please run this code and tell me if it works?

It is same code as yours. Since I can't upload my pdf file, I have searched and found another pdf file it fails on. I don't know if it has to do with tables.

If you run u'r code on this, it will not work.

#!/usr/bin/perl use Data::Dumper; use LWP::UserAgent; my $pdf_filename="/tmp/file.pdf"; my $pdf_link = "http://investor.google.com/pdf/2010Q1_earnings_google. +pdf"; $client = LWP::UserAgent->new(); my $capture = $client->get("$pdf_link", ":content_file" => "$pdf_filen +ame"); convert_pdf_to_text(); sub convert_pdf_to_text { use CAM::PDF; use CAM::PDF::PageText; my $pdf = CAM::PDF->new($pdf_filename) || die "$CAM::PDF::errstr\n +"; foreach (1..($pdf->numPages())) { my $x=CAM::PDF::PageText->render($pdf->getPageContentTree($_)) +; print "$x\n"; } }

Replies are listed 'Best First'.
Re^3: Converting Text from PDF using CAM::PDF
by ww (Archbishop) on Jun 22, 2010 at 22:35 UTC
    As requested:

    Took the trouble to actually download the cited .pdf; saved to my \pl_test dir.

    Modified your script to use the .pdf from the local dir as above.

    W2k; perl -v: v5.8.8 built (819) for MSWin32-x86-multi-thread.

    Using PPM, d/loaded and installed from Bribes: CAM-PDF-1.52 & various prereqs.

    >perl -c F:\_wo\pl_test\pdftest.pl F:\_wo\pl_test\pdftest.pl syntax OK >perl F:\_wo\pl_test\pdftest.pl Invalid xref stream: could not decode objstream 68

    Looks familiar. Sorry, 5.10.1/linux not avail; hot weather cooked that box rather thoroughly.

Re^3: Converting Text from PDF using CAM::PDF
by Khen1950fx (Canon) on Jun 22, 2010 at 23:23 UTC
    You're right---it doesn't work:); however, as I see it, the problem isn't with CAM::PDF but rather with the google pdf. I think that google is great for searching the web, but when it comes to anything else, it's not so good. I've tried the code with my own non-google pdf's and it works. Try this, and let me know if it works or not:
    #!/usr/bin/perl use strict; use warnings; use CAM::PDF; use LWP::UserAgent; my $pdf_filename = '/root/Desktop/perl.pdf'; my $pdf_link = 'http://www.greenteapress.com/perl/perl.pdf'; my $client = LWP::UserAgent->new(); my $capture = $client->get("$pdf_link", ":content_file" => "$pdf_filen +ame"); convert_pdf_to_text(); sub convert_pdf_to_text { use CAM::PDF::PageText; my $pdf_filename = '/root/Desktop/perl.pdf'; my $pdf = CAM::PDF->new($pdf_filename); my $y = $pdf->getPageContentTree(1); print CAM::PDF::PageText->render($y); }
      Same procedure as in prior reply; print what appears to be the entire multi-page text.
      the problem isn't with CAM::PDF but rather with the google pdf

      In other words, the problem is with CAM::PDF, for example