If "highlighted text" is done, as it should, with "colored overlay", then you extract highlights' positions like this:

use strict; use warnings; use feature 'say'; use CAM::PDF; my $pdf = CAM::PDF-> new( $ARGV[ 0 ]) or die; my $page = $pdf-> getPage( 1 ); my $anns = $pdf-> getValue( $page-> { Annots } or die ); for ( @$anns ) { my $ann = $pdf-> getValue( $_ ); next unless $pdf-> getValue( $ann-> { Subtype }) eq 'Highlight'; say $ann; say "\t$_" for map $pdf-> getValue( $_ ), @{ $pdf-> getValue( $ann-> { QuadPoints })} } __END__ HASH(0xd79f0c) 237.641 651.308 271.059 651.308 237.641 641.602 271.059 641.602 61.4118 637.963 92.1406 637.963 61.4118 628.257 92.1406 628.257 HASH(0xe8f43c) 288.529 611.271 320.753 611.271 288.529 601.566 320.753 601.566

Large pdf. They still didn't fix wrong order of points in that picture, take care. Also extract xml with each character bounding box coordinates. Doing it with pure Perl is possible, but involves too much low level work. I prefer mutool (mudraw, if older versions are packaged for your OS) and its "stext" output, GS might also do, adjust for (0,0) being upper left page corner. Walk over character nodes, start extracting text when BB is inside any quad, until you leave that quad. Continue till page end. It's really easy.


In reply to Re: Read highlighted text from PDF by vr
in thread Read highlighted text from PDF by IB2017

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.