Mr. Muskrat has asked for the wisdom of the Perl Monks concerning the following question:

My Fellow Monks,
After using google and Super Search and turning up nothing of value, I started looking through all of the documentation for Text::PDF and PDF::API2. Try as I might, I can not find a way to use Perl to extract the words from a PDF document as plain text or html. Heck, all I see is how to make new PDF files or change existing ones...

I did however find pdf2html. It will be a breeze to run it and then extract the data from the html that it produces.
<whine>But! I don't want to...<\whine>
I know that many of you have worked with PDF files (otherwise there would not be so many hits when doing a Super Search). <begging>Please give me guidance!<\begging>
Can I indeed use one of the modules mentioned to do this?

<hounding>Can I? Can I? Huh? Huh?
Pretty please with sugar on top?
I promise I'll be good (at least until after Christmas).<\hounding>
:^)

Replies are listed 'Best First'.
Re: Extracting the data from a PDF
by Ovid (Cardinal) on Aug 29, 2002 at 21:55 UTC
    I did however find pdf2html. It will be a breeze to run it and then extract the data from the html that it produces.

    <whine>But! I don't want to...<\whine>

    Why not? Assuming you have the HTML in a single scalar:

    use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( \$html ); while ( my $token = $p->get_token ) { next unless $token->is_text; print $token->return_text; }

    There, that wasn't so hard, was it? (note that that example was pretty much cut-n-pasted directly from the POD)

    Oh, and you have the slash backwards on that final whine :)

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: (nrd) Extracting the data from a PDF
by newrisedesigns (Curate) on Aug 29, 2002 at 21:54 UTC

    You could have Perl make system calls to pdf2html, then use HTML::Parser to get all the data out.

    Just my two cents.

    John J Reiser
    newrisedesigns.com

Re: Extracting the data from a PDF
by Mr. Muskrat (Canon) on Aug 29, 2002 at 22:07 UTC

    Thank you for a quick response.

    I really just want plain text but HTML will do as I can I can strip the HTML markup out. I was hoping to stay away from system calls and external programs as much as possible. I am hoping that someone will be able to provide me with the key piece of information that I am missing. In the mean time, I will start coding a version that makes a system call to pdf2html.

    And Ovid, you will notice that they are all backslashes... ;)

Re: Extracting the data from a PDF
by Mr. Muskrat (Canon) on Aug 30, 2002 at 22:14 UTC

    Okay, it's not great but it is a good start. It works for some of the PDF files that I have but not all of them. If the PDF file contains columns of data the output from pdf2html is terrible.

    As always YMMV.

    #!/usr/bin/perl -w use strict; use HTML::TokeParser::Simple; my $file = $ARGV[0]; # base file name (no extension) my $pdf = "$file.pdf"; # the pdf file to fix my $txt = "$file.txt"; # the file to export as text my $pdf2html = "pdftohtml.exe"; # the executable to create the html my $html = "$file.html"; # the html file to create system("$pdf2html -noframes $pdf"); # create the html file my $p = HTML::TokeParser::Simple->new($html); # create the html parser open(TXT, ">", $txt) || die "Cannot open $txt, "; # create the text fi +le while (my $token = $p->get_token) { # print TXT "\n" if ($token->is_start_tag('br')); # may be needed for +some files next if ! $token->is_text; # skip to next token if it's not text my $text = $token->return_text; $text =~ s/&amp;/&/g; # add any html filters here print TXT $text; } close(TXT);