in reply to Read highlighted text from PDF
If "highlighted text" is done, as it should, with "colored overlay", then you extract highlights' positions like this:
use strict; use warnings; use feature 'say'; use CAM::PDF; my $pdf = CAM::PDF-> new( $ARGV[ 0 ]) or die; my $page = $pdf-> getPage( 1 ); my $anns = $pdf-> getValue( $page-> { Annots } or die ); for ( @$anns ) { my $ann = $pdf-> getValue( $_ ); next unless $pdf-> getValue( $ann-> { Subtype }) eq 'Highlight'; say $ann; say "\t$_" for map $pdf-> getValue( $_ ), @{ $pdf-> getValue( $ann-> { QuadPoints })} } __END__ HASH(0xd79f0c) 237.641 651.308 271.059 651.308 237.641 641.602 271.059 641.602 61.4118 637.963 92.1406 637.963 61.4118 628.257 92.1406 628.257 HASH(0xe8f43c) 288.529 611.271 320.753 611.271 288.529 601.566 320.753 601.566
Large pdf. They still didn't fix wrong order of points in that picture, take care. Also extract xml with each character bounding box coordinates. Doing it with pure Perl is possible, but involves too much low level work. I prefer mutool (mudraw, if older versions are packaged for your OS) and its "stext" output, GS might also do, adjust for (0,0) being upper left page corner. Walk over character nodes, start extracting text when BB is inside any quad, until you leave that quad. Continue till page end. It's really easy.
|
---|