Converting Text from PDF using CAM::PDF

mr_p has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Converting Text from PDF using CAM::PDF by almut (Canon) on Jun 23, 2010 at 00:09 UTC
The problem is that there are several versions of the PDF format (from 1.0 to 1.7). Over the years, many extensions have been introduced, and some of the newer ones are not supported by CAM::PDF. One of them (apparently) is compressed xref tables — the xref table is a list of byte offsets pointing to where the individual objects are stored within the file, which in older versions was always uncompressed. This new feature is being used in the sample PDF file you linked to (which is PDF-1.6). You can often work around such problems by using another tool to change the internal format of the PDF file. qpdf is a pretty good one, which provides quite a number of options to play with. For example, you could try: `$ qpdf --stream-data=uncompress in.pdf out.pdf` [download] (and optionally re-compress it with `--stream-data=compress`, if size matters) After applying this procedure to the PDF in question, the converted file(s) could successfully be read by CAM::PDF.	[reply] [d/l] [select]
Re^2: Converting Text from PDF using CAM::PDF by Anonymous Monk on Oct 24, 2011 at 00:27 UTC
Note, PDF::API2 also can't handle compressed xrefs. Ghostscript can also be used to convert the format, but it's very slow.	[reply]
Re: Converting Text from PDF using CAM::PDF by Khen1950fx (Canon) on Jun 23, 2010 at 05:32 UTC
If you can't get CAM::PDF to work, then try xpdf. It has the pdftotext utility. I tried it on your file, and it worked: `#!/usr/bin/perl use strict; use warnings; open (FILE, "pdftotext -f 1 /root/Desktop/urfile.pdf - \|"); my $file = <FILE>; print "$file\n"; close FILE;` [download]	[reply] [d/l]
Re^2: Converting Text from PDF using CAM::PDF by mr_p (Scribe) on Jun 23, 2010 at 14:50 UTC
Yes, I am aware of pdftotext...I was really hoping that it was in perl. I always believe that perl have modules for everything and they work, u never have to go outside of it. Thanks so much.	[reply]
Re: Converting Text from PDF using CAM::PDF by Khen1950fx (Canon) on Jun 22, 2010 at 20:13 UTC
You're working too hard at it. Relax a little, and it'll work. I reworked your script, and it works with no errors. I really couldn't see any reason to use errstr, so I eliminated it. `#!/usr/bin/perl use strict; use warnings; use CAM::PDF; convert_pdf_to_text(); sub convert_pdf_to_text { use CAM::PDF::PageText; my $pdf_filename = '/path/to/pdf'; my $pdf = CAM::PDF->new($pdf_filename); my $y = $pdf->getPageContentTree(1); print CAM::PDF::PageText->render($y); }` [download]	[reply] [d/l]
Re^2: Converting Text from PDF using CAM::PDF by mr_p (Scribe) on Jun 22, 2010 at 20:42 UTC
This is the code that I have... please run this code and tell me if it works? It is same code as yours. Since I can't upload my pdf file, I have searched and found another pdf file it fails on. I don't know if it has to do with tables. If you run u'r code on this, it will not work. #!/usr/bin/perl use Data::Dumper; use LWP::UserAgent; my $pdf_filename="/tmp/file.pdf"; my $pdf_link = "http://investor.google.com/pdf/2010Q1_earnings_google. +pdf"; $client = LWP::UserAgent->new(); my $capture = $client->get("$pdf_link", ":content_file" => "$pdf_filen +ame"); convert_pdf_to_text(); sub convert_pdf_to_text { use CAM::PDF; use CAM::PDF::PageText; my $pdf = CAM::PDF->new($pdf_filename) \|\| die "$CAM::PDF::errstr\n +"; foreach (1..($pdf->numPages())) { my $x=CAM::PDF::PageText->render($pdf->getPageContentTree($_)) +; print "$x\n"; } } [download]	[reply] [d/l]
Re^3: Converting Text from PDF using CAM::PDF by ww (Archbishop) on Jun 22, 2010 at 22:35 UTC
As requested: Took the trouble to actually download the cited .pdf; saved to my \pl_test dir. Modified your script to use the .pdf from the local dir as above. W2k; perl -v: v5.8.8 built (819) for MSWin32-x86-multi-thread. Using PPM, d/loaded and installed from Bribes: CAM-PDF-1.52 & various prereqs. `>perl -c F:\_wo\pl_test\pdftest.pl F:\_wo\pl_test\pdftest.pl syntax OK >perl F:\_wo\pl_test\pdftest.pl Invalid xref stream: could not decode objstream 68` [download] Looks familiar. Sorry, 5.10.1/linux not avail; hot weather cooked that box rather thoroughly.	[reply] [d/l]
Re^3: Converting Text from PDF using CAM::PDF by Khen1950fx (Canon) on Jun 22, 2010 at 23:23 UTC
You're right---it doesn't work:); however, as I see it, the problem isn't with CAM::PDF but rather with the google pdf. I think that google is great for searching the web, but when it comes to anything else, it's not so good. I've tried the code with my own non-google pdf's and it works. Try this, and let me know if it works or not: #!/usr/bin/perl use strict; use warnings; use CAM::PDF; use LWP::UserAgent; my $pdf_filename = '/root/Desktop/perl.pdf'; my $pdf_link = 'http://www.greenteapress.com/perl/perl.pdf'; my $client = LWP::UserAgent->new(); my $capture = $client->get("$pdf_link", ":content_file" => "$pdf_filen +ame"); convert_pdf_to_text(); sub convert_pdf_to_text { use CAM::PDF::PageText; my $pdf_filename = '/root/Desktop/perl.pdf'; my $pdf = CAM::PDF->new($pdf_filename); my $y = $pdf->getPageContentTree(1); print CAM::PDF::PageText->render($y); } [download]	[reply] [d/l]
Re^4: Converting Text from PDF using CAM::PDF by ww (Archbishop) on Jun 23, 2010 at 00:12 UTC
Re^4: Converting Text from PDF using CAM::PDF by Anonymous Monk on Jun 22, 2010 at 23:36 UTC