Re: Converting Text from PDF using CAM::PDF

Replies are listed 'Best First'.
Re^2: Converting Text from PDF using CAM::PDF by mr_p (Scribe) on Jun 22, 2010 at 20:42 UTC
This is the code that I have... please run this code and tell me if it works? It is same code as yours. Since I can't upload my pdf file, I have searched and found another pdf file it fails on. I don't know if it has to do with tables. If you run u'r code on this, it will not work. #!/usr/bin/perl use Data::Dumper; use LWP::UserAgent; my $pdf_filename="/tmp/file.pdf"; my $pdf_link = "http://investor.google.com/pdf/2010Q1_earnings_google. +pdf"; $client = LWP::UserAgent->new(); my $capture = $client->get("$pdf_link", ":content_file" => "$pdf_filen +ame"); convert_pdf_to_text(); sub convert_pdf_to_text { use CAM::PDF; use CAM::PDF::PageText; my $pdf = CAM::PDF->new($pdf_filename) \|\| die "$CAM::PDF::errstr\n +"; foreach (1..($pdf->numPages())) { my $x=CAM::PDF::PageText->render($pdf->getPageContentTree($_)) +; print "$x\n"; } } [download]	[reply] [d/l]
Re^3: Converting Text from PDF using CAM::PDF by ww (Archbishop) on Jun 22, 2010 at 22:35 UTC
As requested: Took the trouble to actually download the cited .pdf; saved to my \pl_test dir. Modified your script to use the .pdf from the local dir as above. W2k; perl -v: v5.8.8 built (819) for MSWin32-x86-multi-thread. Using PPM, d/loaded and installed from Bribes: CAM-PDF-1.52 & various prereqs. `>perl -c F:\_wo\pl_test\pdftest.pl F:\_wo\pl_test\pdftest.pl syntax OK >perl F:\_wo\pl_test\pdftest.pl Invalid xref stream: could not decode objstream 68` [download] Looks familiar. Sorry, 5.10.1/linux not avail; hot weather cooked that box rather thoroughly.	[reply] [d/l]
Re^3: Converting Text from PDF using CAM::PDF by Khen1950fx (Canon) on Jun 22, 2010 at 23:23 UTC
You're right---it doesn't work:); however, as I see it, the problem isn't with CAM::PDF but rather with the google pdf. I think that google is great for searching the web, but when it comes to anything else, it's not so good. I've tried the code with my own non-google pdf's and it works. Try this, and let me know if it works or not: #!/usr/bin/perl use strict; use warnings; use CAM::PDF; use LWP::UserAgent; my $pdf_filename = '/root/Desktop/perl.pdf'; my $pdf_link = 'http://www.greenteapress.com/perl/perl.pdf'; my $client = LWP::UserAgent->new(); my $capture = $client->get("$pdf_link", ":content_file" => "$pdf_filen +ame"); convert_pdf_to_text(); sub convert_pdf_to_text { use CAM::PDF::PageText; my $pdf_filename = '/root/Desktop/perl.pdf'; my $pdf = CAM::PDF->new($pdf_filename); my $y = $pdf->getPageContentTree(1); print CAM::PDF::PageText->render($y); } [download]	[reply] [d/l]
Re^4: Converting Text from PDF using CAM::PDF by ww (Archbishop) on Jun 23, 2010 at 00:12 UTC
Same procedure as in prior reply; print what appears to be the entire multi-page text.	[reply]
Re^4: Converting Text from PDF using CAM::PDF by Anonymous Monk on Jun 22, 2010 at 23:36 UTC
the problem isn't with CAM::PDF but rather with the google pdf In other words, the problem is with CAM::PDF, for example	[reply]