in reply to Re: Extract text from PDF
in thread Extract text from PDF

Hi, I tried the following cases :
use PDF::Extract; $pdf=new PDF::Extract; $pdf->servePDFExtract( PDFDoc=>"c:/Docs/my.pdf", PDFPages=>"1-3 3 +1-36" );
use PDF::Extract; $pdf = new PDF::Extract( PDFDoc=>'C:/my.pdf' ); $pdf->getPDFExtract( PDFPages=>$PDFPages ); print "Content-Type text/plain\n\n<xmp>", $pdf->getVars("PDFExtract" +); print $pdf->getVars("PDFError"); or # Extract and save, in the current directory, all the pages in a pdf + document use PDF::Extract; $pdf=new PDF::Extract( PDFDoc=>"test.pdf"); $i=1; $i++ while ( $pdf->savePDFExtract( PDFPages=>$i ) );


However in all the above cases I am getting the same output which I gave in my problem statement .
I am just unable to retrieve the pdf contents in text form . I am thankful for bearing me and giving me valuable advices. But since I am working on a plugin which need pdf to be converted in htm using perl script , I hope to get some more help from a great and perl expert like you . Please let me know on how I can get the output to html form . Thankyou !

Replies are listed 'Best First'.
Re: (3) Extract text from PDF
by Roger (Parson) on Nov 29, 2003 at 12:02 UTC
    You can save a PDF in text mode or binary mode. Check the PDF format to see what mode it was saved in the first place. The text mode PDF is a lot easier to work with than the binary mode. See if you could resave the PDF's in text mode and then try again.

Re: Re: (2) Extract text from PDF
by Corion (Patriarch) on Nov 30, 2003 at 01:10 UTC

    The code you posted are the verbatim snippets out of the PDF::Extract synopsis. I don't know what you think what they should do, but I tried the following program on the ECMA ECMAScript 1.3 standard available from mozilla.org and it did exactly what the documentation promised, it created a file E262-31..3.pdf, which I could open with Acrobat Reader, and the newly created document started with page one of the ECMA standard 262, with the words ECMAScript Language Specification, and ended with page 3, after the word Steve Leach.

    #!/usr/bin/perl -w use strict; use PDF::Extract; # tested on http://www.mozilla.org/js/language/E262-3.pdf my $filename = 'E262-3.pdf'; my $pages = '1-3'; my $outputname = 'E262-31..3.pdf'; # see PDF::Extract documentation my $pdf = PDF::Extract->new(); print "Saving from $filename pages $pages to $outputname"; $pdf->savePDFExtract( PDFDoc => $filename, PDFPages => $pages ); print ",done.\n"; my $error = $pdf->getVars('PDFError'); warn $error if $error; if (-f $outputname) { print "There now exists a file '$outputname'\n"; } else { print "No file '$outputname' was found. Maybe there was some error?\ +n"; };

    I am not sure what different results you expected and what else you tried. Maybe you have to reread the documentation, as neither of your examples seems to be about extracting ASCII text from PDF pages, but I don't know, as you seem to be trying to mix HTML and PDF, something which can't work.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
Re: (3) Extract text from PDF
by Anonymous Monk on Nov 29, 2003 at 11:31 UTC
    I am thankful for bearing me and giving me valuable advices. But since I am working on a plugin which need pdf to be converted in htm using perl script , I hope to get some more help from a great and perl expert like you. Please let me know on how I can get the output to html form . Thankyou !
    Money talks.