Extract text from PDF

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have made a perl script which reads a pdf file and writes its contents to another file.
The code is given below :

#!/use/bin/perl

$path_to_file1 = "adminhelp.pdf"; 
$path_to_newfile = "hello_world.out"; 


# start displaying the HTML page 
print "Content-type: text/html\n\n"; 
print "<html><head></head>\r\n"; 
print "<body>\r\n"; 

open (READFILE, "$path_to_file1") || &errorfunc("Couldn't open the fil
+e [$path_to_file1] to read."); 
print "The file [$path_to_file1] was opened for reading<br>"; 
($dev, $ino, $mode, $nlink, $uid, $gid, $rdev, $size, $atime, $mtime, 
+$ctime, $blksize, $blocks) = stat $path_to_file1; 
print "The size of file [$path_to_file1] is $size<br>"; 

open (WRITEFILE, ">$path_to_newfile") || &errorfunc("Couldn't open the
+ file [$path_to_newfile] to write."); 
print "The file [$path_to_newfile] was opened for writing<br>"; 

binmode READFILE;

while (read READFILE, $buf, 16384) { 
print <<HTML 
WRITEFILE $buf
HTML
; 
} 

close (READFILE); 
close (WRITEFILE); 

print "<br><br>Finished copying from [$path_to_file1] to [$path_to_new
+file]<br>"; 
print "<br><br></body></html>"; 
exit;
[download]

However , when I open the file in which I wrote the pdf contents reads as follows :

 %PDF-1.3

[binary section]

2 0 obj
<<
/BitsPerComponent 8
/ColorSpace/DeviceRGB
/Filter[/DCTDecode]
/Height 73
/Subtype/Image
/Type/XObject
/Length 2855
/Width 107
>>
stream

[binary section]

endstream
endobj
1 0 obj
<<
/Length 5281
/Filter [/FlateDecode]
>>
stream

[binary section]

<<
/MediaBox[0 0 612 792]
/Resources<>/ProcSet[/PDF/Text/ImageC]/Font<>>>
/Type/Page
/Contents 7 0 R
/Parent 4 0 R
>>
endobj
10 0 obj
<<
/MediaBox[0 0 612 792]
/Resources<>/ProcSet[/PDF/Text/ImageC]/Font<>>>
/Type/Page
/Contents 9 0 R
/Parent 4 0 R
>>
endobj
12 0 obj
<<
/MediaBox[0 0 612 792]
/Resources<>/ProcSet[/PDF/Text/ImageC]/Font<>>>
/Type/Page
/Contents 11 0 R
/Parent 4 0 R
>>
endobj
14 0 obj
<<
/MediaBox[0 0 612 792]
/Resources<>/ProcSet[/PDF/Text/ImageC]/Font<>>>
/Type/Page
/Contents 13 0 R
/Parent 4 0 R
>>
endobj
18 0 obj
<<
/Type/Pages/Count 1
/Parent 17 0 R
/Kids[16 0 R]
>>
endobj
16 0 obj
<<
/MediaBox[0 0 612 792]
/Resources<>/ProcSet[/PDF/Text/ImageC]/Font<>>>
/Type/Page
/Contents 15 0 R
/Parent 18 0 R
>>
endobj
32 0 obj
<<
/PageMode/UseNone
/Type/Catalog
/OpenAction[3 0 R/XYZ null null null]
/PageLabels 19 0 R
/Pages 17 0 R
>>
endobj
33 0 obj
<<
/Subject()
/CreationDate(D:20020516122721)
/Producer(Jaws PDF Creator, Word macro v2.11.29)
/Author(vivek)
/Keywords()
/Title(This help page consists of following modules)
/Creator(Microsoft Word 9.0)
>>
endobj
xref
0 34
0000000000 65535 f 
0000003034 00000 n 
0000000015 00000 n 
0000057178 00000 n 
0000057076 00000 n 
0000008389 00000 n 
0000057365 00000 n 
0000015319 00000 n 
0000057552 00000 n 
0000019184 00000 n 
0000057739 00000 n 
0000024202 00000 n 
0000057927 00000 n 
0000028903 00000 n 
0000058116 00000 n 
0000033317 00000 n 
0000058376 00000 n 
0000057014 00000 n 
0000058305 00000 n 
0000038708 00000 n 
0000038747 00000 n 
0000038927 00000 n 
0000042174 00000 n 
0000042380 00000 n 
0000042750 00000 n 
0000043107 00000 n 
0000048691 00000 n 
0000048897 00000 n 
0000049361 00000 n 
0000049731 00000 n 
0000056334 00000 n 
0000056540 00000 n 
0000058566 00000 n 
0000058691 00000 n 
trailer
<<
/Size 34
/Root 32 0 R
/Info 33 0 R
/ID[]
>>
startxref
58914
%%EOF
[download]

I am just unable to know what the above output represents.
How can I transfer it to plain text or the text that pdf represents .
Do I have to use some decoding or any other method to get the contents in text form .
Please let me know on that .

Edit: BazB deleted binary sections of PDF (marked removed sections with [binary section] )

_{janitored by ybiC: Balanced <readmore> tags, retitle from less-than-descriptive "Help on Perl Script"}

Comment on Extract text from PDF Select or Download Code

Replies are listed 'Best First'.
Re: Extract text from PDF by Corion (Patriarch) on Nov 29, 2003 at 09:47 UTC
By interesting coincidence, we just had a good number of discussions about how to extract the text from PDF. Please read this discussion first and then maybe come back with what you didn't understand there. Extracting the plain text from a PDF file is relatively easy, but it is not done by simply printing the PDF file together with some HTML tags. Please first understand the differences between PDF and HTML before you try to do something like this. `perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web` [download]	[reply] [d/l]
Re: (2) Extract text from PDF by Anonymous Monk on Nov 29, 2003 at 10:26 UTC
Hi, I tried the following cases : `use PDF::Extract; $pdf=new PDF::Extract; $pdf->servePDFExtract( PDFDoc=>"c:/Docs/my.pdf", PDFPages=>"1-3 3 +1-36" );` [download] `use PDF::Extract; $pdf = new PDF::Extract( PDFDoc=>'C:/my.pdf' ); $pdf->getPDFExtract( PDFPages=>$PDFPages ); print "Content-Type text/plain\n\n<xmp>", $pdf->getVars("PDFExtract" +); print $pdf->getVars("PDFError"); or # Extract and save, in the current directory, all the pages in a pdf + document use PDF::Extract; $pdf=new PDF::Extract( PDFDoc=>"test.pdf"); $i=1; $i++ while ( $pdf->savePDFExtract( PDFPages=>$i ) );` [download] However in all the above cases I am getting the same output which I gave in my problem statement . I am just unable to retrieve the pdf contents in text form . I am thankful for bearing me and giving me valuable advices. But since I am working on a plugin which need pdf to be converted in htm using perl script , I hope to get some more help from a great and perl expert like you . Please let me know on how I can get the output to html form . Thankyou !	[reply] [d/l] [select]
Re: (3) Extract text from PDF by Roger (Parson) on Nov 29, 2003 at 12:02 UTC
You can save a PDF in text mode or binary mode. Check the PDF format to see what mode it was saved in the first place. The text mode PDF is a lot easier to work with than the binary mode. See if you could resave the PDF's in text mode and then try again.	[reply]
Re: Re: (2) Extract text from PDF by Corion (Patriarch) on Nov 30, 2003 at 01:10 UTC
The code you posted are the verbatim snippets out of the PDF::Extract synopsis. I don't know what you think what they should do, but I tried the following program on the ECMA ECMAScript 1.3 standard available from mozilla.org and it did exactly what the documentation promised, it created a file `E262-31..3.pdf`, which I could open with Acrobat Reader, and the newly created document started with page one of the ECMA standard 262, with the words ECMAScript Language Specification, and ended with page 3, after the word Steve Leach. #!/usr/bin/perl -w use strict; use PDF::Extract; # tested on http://www.mozilla.org/js/language/E262-3.pdf my $filename = 'E262-3.pdf'; my $pages = '1-3'; my $outputname = 'E262-31..3.pdf'; # see PDF::Extract documentation my $pdf = PDF::Extract->new(); print "Saving from $filename pages $pages to $outputname"; $pdf->savePDFExtract( PDFDoc => $filename, PDFPages => $pages ); print ",done.\n"; my $error = $pdf->getVars('PDFError'); warn $error if $error; if (-f $outputname) { print "There now exists a file '$outputname'\n"; } else { print "No file '$outputname' was found. Maybe there was some error?\ +n"; }; [download] I am not sure what different results you expected and what else you tried. Maybe you have to reread the documentation, as neither of your examples seems to be about extracting ASCII text from PDF pages, but I don't know, as you seem to be trying to mix HTML and PDF, something which can't work. `perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web` [download]	[reply] [d/l] [select]
Re: (3) Extract text from PDF by Anonymous Monk on Nov 29, 2003 at 11:31 UTC
I am thankful for bearing me and giving me valuable advices. But since I am working on a plugin which need pdf to be converted in htm using perl script , I hope to get some more help from a great and perl expert like you. Please let me know on how I can get the output to html form . Thankyou ! Money talks.	[reply]