Extracting the data from a PDF

Mr. Muskrat has asked for the wisdom of the Perl Monks concerning the following question:

My Fellow Monks,
After using google and Super Search and turning up nothing of value, I started looking through all of the documentation for Text::PDF and PDF::API2. Try as I might, I can not find a way to use Perl to extract the words from a PDF document as plain text or html. Heck, all I see is how to make new PDF files or change existing ones...

I did however find pdf2html. It will be a breeze to run it and then extract the data from the html that it produces.
<whine>But! I don't want to...<\whine>
I know that many of you have worked with PDF files (otherwise there would not be so many hits when doing a Super Search). <begging>Please give me guidance!<\begging>
Can I indeed use one of the modules mentioned to do this?

<hounding>Can I? Can I? Huh? Huh?
Pretty please with sugar on top?
I promise I'll be good (at least until after Christmas).<\hounding>
:^)

Comment on Extracting the data from a PDF

Replies are listed 'Best First'.
Re: Extracting the data from a PDF by Ovid (Cardinal) on Aug 29, 2002 at 21:55 UTC
I did however find pdf2html. It will be a breeze to run it and then extract the data from the html that it produces. <whine>But! I don't want to...<\whine> Why not? Assuming you have the HTML in a single scalar: `use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( \$html ); while ( my $token = $p->get_token ) { next unless $token->is_text; print $token->return_text; }` [download] There, that wasn't so hard, was it? (note that that example was pretty much cut-n-pasted directly from the POD) Oh, and you have the slash backwards on that final whine :) Cheers, Ovid Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply] [d/l]
Re: (nrd) Extracting the data from a PDF by newrisedesigns (Curate) on Aug 29, 2002 at 21:54 UTC
You could have Perl make system calls to pdf2html, then use HTML::Parser to get all the data out. Just my two cents. John J Reiser newrisedesigns.com	[reply]
Re: Extracting the data from a PDF by Mr. Muskrat (Canon) on Aug 29, 2002 at 22:07 UTC
Thank you for a quick response. I really just want plain text but HTML will do as I can I can strip the HTML markup out. I was hoping to stay away from system calls and external programs as much as possible. I am hoping that someone will be able to provide me with the key piece of information that I am missing. In the mean time, I will start coding a version that makes a system call to pdf2html. And Ovid, you will notice that they are all backslashes... ;)	[reply]
Re: Extracting the data from a PDF by Mr. Muskrat (Canon) on Aug 30, 2002 at 22:14 UTC
Okay, it's not great but it is a good start. It works for some of the PDF files that I have but not all of them. If the PDF file contains columns of data the output from pdf2html is terrible. As always YMMV. #!/usr/bin/perl -w use strict; use HTML::TokeParser::Simple; my $file = $ARGV[0]; # base file name (no extension) my $pdf = "$file.pdf"; # the pdf file to fix my $txt = "$file.txt"; # the file to export as text my $pdf2html = "pdftohtml.exe"; # the executable to create the html my $html = "$file.html"; # the html file to create system("$pdf2html -noframes $pdf"); # create the html file my $p = HTML::TokeParser::Simple->new($html); # create the html parser open(TXT, ">", $txt) \|\| die "Cannot open $txt, "; # create the text fi +le while (my $token = $p->get_token) { # print TXT "\n" if ($token->is_start_tag('br')); # may be needed for +some files next if ! $token->is_text; # skip to next token if it's not text my $text = $token->return_text; $text =~ s/&/&/g; # add any html filters here print TXT $text; } close(TXT); [download]	[reply] [d/l]