knbknb has asked for the wisdom of the Perl Monks concerning the following question:

I want to put hyperlinks on some strings that are in a complex PDF page. I don't have access to the source document for the PDF file.

How can I search for text strings on a page from the existing pdf file? I also need the position of the textlabels to place the hyperlink. Please see code below.

Is there a method to extract all text from a single pdf page, and iterate through it? Or should I use the Data::Dumper structure? I know by greeping though Data::Dumper output that it contains what I need, but the structure is too complex. There must be a better way.

#!/usr/bin/perl -w use CAM::PDF; use Getopt::Long; use Pod::Usage; use English qw(-no_match_vars); use PDF::API2; use Data::Dumper; use strict; our $VERSION = '0.02'; my %opts = ( infile => undef, pagenum => undef, verbose => 0, help => 0, version => 0, ); Getopt::Long::Configure('bundling'); GetOptions('i|infile=s' => \$opts{infile}, 'p|pagenum=s' => \$opts{pagenum}, 'v|verbose' => \$opts{verbose}, 'h|help' => \$opts{help}, 'V|version' => \$opts{version}, ) or pod2usage(1); if ($opts{help}) { pod2usage(-exitstatus => 0, -verbose => 2); } if ($opts{version}) { print "v$VERSION\n"; exit 0; } unless ($opts{infile} ){ print "Missing option -i (infile) \n"; die; } $opts{pagenum} ||= 1; # get contents of e.g. page #27 # parse strings: my $infile = $opts{infile}; my $campdf=CAM::PDF->new($infile) or die "Can't open infile: $!\n"; my $numpages = $campdf->numPages(); print "Document $infile has $numpages pages.\n"; #rangeToArray ($pkg_or_doc, $min, $max, @range_parts) = @_; my @range = CAM::PDF->rangeToArray(1, $numpages, "$opts{pagenum}"); my $range = join ", " , @range; print "Checking page(s) " . $range . ".\n"; ## done with campdf # for each string $s matching something:.../, # find coordinates of bounding box # add a link at that position, consisting of: http://my.org/$s my $pdfa2 = PDF::API2->open($infile); foreach my $pagenum (@range){ my ($x, $y); $x = 40; $y = 680; my $pdfpage = $pdfa2->openpage($pagenum); my $str = $pdfpage->text(); ## HERE I NEED TO parse all text strings on the page ## but I don't know how to do that ## continuing with hard-wired example... $pdfpage->gfx->textlabel( $x, $y, $pdfa2->corefont('Arial',-encode => 'latin1'), 10, "Link", ); my $url = qq{ http://my.org/something}; draw_url($pdfa2, $pagenum, [$x, $y, $x + 50, $y + 10], $url); $pdfpage->update(); print "page $pagenum\n"; } # save the page to a new file. $pdfa2->saveas("doc/new.pdf"); $pdfa2->end; sub draw_url { my ($pdf, $page_num, $dims, $url) = @_; my $page = $pdf->openpage($page_num); my $an = $page->annotation; $an->url($url, (-rect => $dims), (-border => [1, 1, 1])); $page->update; }

Replies are listed 'Best First'.
Re: PDF::API2 to search for text in PDF file
by bellaire (Hermit) on Mar 25, 2009 at 14:00 UTC
    There must be a better way.
    Not as far as I know. If you need to find not only the text but the actual positioning of the label containing that text, you're going to have to parse the entire structure. I don't know that doing that using Data::Dumper rather than CAM::PDF or PDF::API2's internal methods is a good idea, but no matter how you slice it, you basically have to mimic the rendering process (parsing the page tree) to get the actual page positions of the text.

    And even then, if you are searching for a substring you'll only have the position of the text container, not the position of the substring itself. To get the position of the substring would require actually rendering the PDF, complete with its fonts.
      Data::Dumper was only my first quick and dirty solution; I noticed that its output contains lots of stuff, and in places there is often something like

      #(250.00, 650.00) This is some pdftext

      The text in coordinates is presumably the position on the page. With some assumptions (font size is always 10-12, box width can also constrained/guessed meaningfully), I could put a hyperlink there, which would approximately be at the right position.

      Afterwards, manual editing could remove the URL or change it. This would still be much quicker than setting all the URLs manually from scratch. I would also happily switch to a different tool that accomplishes dumping text and position to a text file. For instance, we have acrobat 8 here but I haven't tried its javascript API. A table of

      ### page ### position x,y ### matched text ####

      would suffice for a while. I could use this as input for my script.

      I still don't know what to do with text that wraps around on the page, though. These hyperlinks would be incomplete and hence invalid.

Re: PDF::API2 to search for text in PDF file
by knbknb (Acolyte) on Mar 26, 2009 at 17:42 UTC
    This is how I proceeded:

    I have used a freeware tool called "a-pdf text extractor" from a-pdf.com. This allows text to be extracted with position.

    Then I slurp in the resulting text file, remove duplicate entries and extract those lines that I'm interested in. Using a more advanced version of the code above I can put links on some 50% of the pages on the pdf file. In order to get close to 100% I had to patch 1 line of PDF::API2's source code.

      Can you post your mods and code? I am interested in doing the same thing.