in reply to Converting Text from PDF using CAM::PDF

You're working too hard at it. Relax a little, and it'll work. I reworked your script, and it works with no errors. I really couldn't see any reason to use errstr, so I eliminated it.
#!/usr/bin/perl use strict; use warnings; use CAM::PDF; convert_pdf_to_text(); sub convert_pdf_to_text { use CAM::PDF::PageText; my $pdf_filename = '/path/to/pdf'; my $pdf = CAM::PDF->new($pdf_filename); my $y = $pdf->getPageContentTree(1); print CAM::PDF::PageText->render($y); }

Replies are listed 'Best First'.
Re^2: Converting Text from PDF using CAM::PDF
by mr_p (Scribe) on Jun 22, 2010 at 20:42 UTC
    This is the code that I have...

    please run this code and tell me if it works?

    It is same code as yours. Since I can't upload my pdf file, I have searched and found another pdf file it fails on. I don't know if it has to do with tables.

    If you run u'r code on this, it will not work.

    #!/usr/bin/perl use Data::Dumper; use LWP::UserAgent; my $pdf_filename="/tmp/file.pdf"; my $pdf_link = "http://investor.google.com/pdf/2010Q1_earnings_google. +pdf"; $client = LWP::UserAgent->new(); my $capture = $client->get("$pdf_link", ":content_file" => "$pdf_filen +ame"); convert_pdf_to_text(); sub convert_pdf_to_text { use CAM::PDF; use CAM::PDF::PageText; my $pdf = CAM::PDF->new($pdf_filename) || die "$CAM::PDF::errstr\n +"; foreach (1..($pdf->numPages())) { my $x=CAM::PDF::PageText->render($pdf->getPageContentTree($_)) +; print "$x\n"; } }
      As requested:

      Took the trouble to actually download the cited .pdf; saved to my \pl_test dir.

      Modified your script to use the .pdf from the local dir as above.

      W2k; perl -v: v5.8.8 built (819) for MSWin32-x86-multi-thread.

      Using PPM, d/loaded and installed from Bribes: CAM-PDF-1.52 & various prereqs.

      >perl -c F:\_wo\pl_test\pdftest.pl F:\_wo\pl_test\pdftest.pl syntax OK >perl F:\_wo\pl_test\pdftest.pl Invalid xref stream: could not decode objstream 68

      Looks familiar. Sorry, 5.10.1/linux not avail; hot weather cooked that box rather thoroughly.

      You're right---it doesn't work:); however, as I see it, the problem isn't with CAM::PDF but rather with the google pdf. I think that google is great for searching the web, but when it comes to anything else, it's not so good. I've tried the code with my own non-google pdf's and it works. Try this, and let me know if it works or not:
      #!/usr/bin/perl use strict; use warnings; use CAM::PDF; use LWP::UserAgent; my $pdf_filename = '/root/Desktop/perl.pdf'; my $pdf_link = 'http://www.greenteapress.com/perl/perl.pdf'; my $client = LWP::UserAgent->new(); my $capture = $client->get("$pdf_link", ":content_file" => "$pdf_filen +ame"); convert_pdf_to_text(); sub convert_pdf_to_text { use CAM::PDF::PageText; my $pdf_filename = '/root/Desktop/perl.pdf'; my $pdf = CAM::PDF->new($pdf_filename); my $y = $pdf->getPageContentTree(1); print CAM::PDF::PageText->render($y); }
        Same procedure as in prior reply; print what appears to be the entire multi-page text.
        the problem isn't with CAM::PDF but rather with the google pdf

        In other words, the problem is with CAM::PDF, for example