I want to put hyperlinks on some strings that are in a complex PDF page. I don't have access to the source document for the PDF file.

How can I search for text strings on a page from the existing pdf file? I also need the position of the textlabels to place the hyperlink. Please see code below.

Is there a method to extract all text from a single pdf page, and iterate through it? Or should I use the Data::Dumper structure? I know by greeping though Data::Dumper output that it contains what I need, but the structure is too complex. There must be a better way.

#!/usr/bin/perl -w use CAM::PDF; use Getopt::Long; use Pod::Usage; use English qw(-no_match_vars); use PDF::API2; use Data::Dumper; use strict; our $VERSION = '0.02'; my %opts = ( infile => undef, pagenum => undef, verbose => 0, help => 0, version => 0, ); Getopt::Long::Configure('bundling'); GetOptions('i|infile=s' => \$opts{infile}, 'p|pagenum=s' => \$opts{pagenum}, 'v|verbose' => \$opts{verbose}, 'h|help' => \$opts{help}, 'V|version' => \$opts{version}, ) or pod2usage(1); if ($opts{help}) { pod2usage(-exitstatus => 0, -verbose => 2); } if ($opts{version}) { print "v$VERSION\n"; exit 0; } unless ($opts{infile} ){ print "Missing option -i (infile) \n"; die; } $opts{pagenum} ||= 1; # get contents of e.g. page #27 # parse strings: my $infile = $opts{infile}; my $campdf=CAM::PDF->new($infile) or die "Can't open infile: $!\n"; my $numpages = $campdf->numPages(); print "Document $infile has $numpages pages.\n"; #rangeToArray ($pkg_or_doc, $min, $max, @range_parts) = @_; my @range = CAM::PDF->rangeToArray(1, $numpages, "$opts{pagenum}"); my $range = join ", " , @range; print "Checking page(s) " . $range . ".\n"; ## done with campdf # for each string $s matching something:.../, # find coordinates of bounding box # add a link at that position, consisting of: http://my.org/$s my $pdfa2 = PDF::API2->open($infile); foreach my $pagenum (@range){ my ($x, $y); $x = 40; $y = 680; my $pdfpage = $pdfa2->openpage($pagenum); my $str = $pdfpage->text(); ## HERE I NEED TO parse all text strings on the page ## but I don't know how to do that ## continuing with hard-wired example... $pdfpage->gfx->textlabel( $x, $y, $pdfa2->corefont('Arial',-encode => 'latin1'), 10, "Link", ); my $url = qq{ http://my.org/something}; draw_url($pdfa2, $pagenum, [$x, $y, $x + 50, $y + 10], $url); $pdfpage->update(); print "page $pagenum\n"; } # save the page to a new file. $pdfa2->saveas("doc/new.pdf"); $pdfa2->end; sub draw_url { my ($pdf, $page_num, $dims, $url) = @_; my $page = $pdf->openpage($page_num); my $an = $page->annotation; $an->url($url, (-rect => $dims), (-border => [1, 1, 1])); $page->update; }

In reply to PDF::API2 to search for text and place hyperlinks in PDF file by knbknb

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.