in reply to Re^2: XPDF pdftotext page loop
in thread XPDF pdftotext page loop
Since pdftotext defaults to inserting form feed characters between pages, you can examine each line for a form feed character as an indication of pagination:
use strict; use warnings; my $i = 0; my $pageNum = 1; open my $fh, "pdftotext -layout multipage.pdf - |" or die $!; print "---------- Begin Page $pageNum ----------\n"; while ( my $line = <$fh> ) { if ( $line =~ /\xC/ ) { print "\n---------- End Page $pageNum ----------\n"; $pageNum++; print "---------- Begin Page $pageNum ----------\n"; } $i++; print "\n<div class=\"line\"><div>$i</div>$line</div>"; } close $fh;
Another option which may serve you is using CAM::PDF:
use strict; use warnings; use CAM::PDF; my $pdf = CAM::PDF->new('multipage.pdf'); for my $pageNumber ( 1 .. $pdf->numPages() ) { my $pageText = $pdf->getPageText($pageNumber); my @certainLines = ( split /\n/, $pageText )[ 9 .. 14 ]; print "---------- Lines 10 - 15 on Page $pageNumber ----------\n"; print +( join "\n", @certainLines ) . "\n"; print "---------- End Page $pageNumber ----------\n"; }
The above shows how to grab a range of text lines from the converted pdf page. You may find, however, that pdftotext does a better rendering job.
Hope this helps!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: XPDF pdftotext page loop
by Anonymous Monk on Oct 11, 2012 at 12:59 UTC |