in reply to Trouble in text manipulation
I think this bit from the OP is the "goal", but I'm not quite sure what you mean by this:
Am trying to club the <text entries together those having same y & page attribute values from the below input.
Do you mean: if there are two (or three, or more) "text" elements within one "font" element, and they all have the same attribute values for "y" and "page", you want them to be collapsed together into a single "text" element? If that's what you mean, that's an intriguing task, which I was able to solve using XML::LibXML. (Other monks could probably do it more neatly -- "I am just an egg" when it comes to DOM manipulation...)
In order to get that data to work with XML::LibXML (or any XML parser), I needed to add a "root" element around the set of "font" elements. Here's the code with the parsable version of the data attached:
(update: streamlined the logic for concatenating textContent values onto $matched_content)#!/usr/bin/perl use strict; use warnings; use XML::LibXML; my $xmlstring; { local $/; $xmlstring = <DATA>; } my $xml = XML::LibXML->new; my $doc = $xml->parse_string( $xmlstring ); # you can do $xml->parse_file( "filename" ) instead for my $font_node ( $doc->findnodes( "//font" )) { my %attr_val; my $matched_text_nodes = 1; my $matched_content; my @text_nodes = $font_node->findnodes( "./text" ); for my $tnode ( @text_nodes ) { my @atts = $tnode->attributes; my $y_indx = grep { $atts[$_]->nodeName eq 'y' } 0 .. $#atts; my $p_indx = grep { $atts[$_]->nodeName eq 'page' } 0 .. $#att +s; if ( ! keys %attr_val ) { # first text_node $attr_val{y} = $atts[$y_indx]->textContent; $attr_val{p} = $atts[$p_indx]->textContent; } elsif ( $attr_val{y} ne $atts[$y_indx]->textContent or $attr_val{p} ne $atts[$p_indx]->textContent ) { $matched_text_nodes = 0; } if ( $matched_text_nodes ) { $matched_content .= $tnode->textContent . " "; } } if ( $matched_text_nodes ) { $text_nodes[0]->firstChild->setData( $matched_content ); $font_node->removeChild( $_ ) for ( @text_nodes[1..$#text_node +s] ); } } print $doc->toString; __DATA__ <doc> <font size="12" face="IJCINN+AvantGarde-Bold" color="#1BADEB"> <text x="198" y="200" width="32" height="12" page="vii">Part I</text> <text x="242" y="200" width="75" height="12" page="vii">Introduction</ +text> <text x="329" y="200" width="7" height="12" page="vii">2</text> </font> <font size="9" face="IJCINN+AvantGarde-Bold" color="#231F20"> <text x="183" y="221" width="47" height="9" page="vii">Chapter 1</text +> </font> <font size="10" face="IJCIOP+Frutiger-Light" color="#231F20"> <text x="242" y="220" width="121" height="10" page="vii">Managers and +Management</text> <text x="373" y="220" width="6" height="10" page="vii">2</text> </font> <font size="9" face="IJCINN+AvantGarde-Bold" color="#231F20"> <text x="198" y="234" width="32" height="9" page="vii">History</text> <text x="195" y="246" width="35" height="9" page="vii">Module</text> </font> <font size="12" face="IJCINN+AvantGarde-Bold" color="#1BADEB"> <text x="194" y="292" width="36" height="12" page="vii">Part II</text> <text x="242" y="292" width="54" height="12" page="vii">Planning</text +> <text x="308" y="292" width="15" height="12" page="vii">56</text> </font> </doc>
<?xml version="1.0"?> <doc> <font size="12" face="IJCINN+AvantGarde-Bold" color="#1BADEB"> <text x="198" y="200" width="32" height="12" page="vii">Part I Introdu +ction 2 </text> </font> <font size="9" face="IJCINN+AvantGarde-Bold" color="#231F20"> <text x="183" y="221" width="47" height="9" page="vii">Chapter 1 </tex +t> </font> <font size="10" face="IJCIOP+Frutiger-Light" color="#231F20"> <text x="242" y="220" width="121" height="10" page="vii">Managers and +Management 2 </text> </font> <font size="9" face="IJCINN+AvantGarde-Bold" color="#231F20"> <text x="198" y="234" width="32" height="9" page="vii">History</text> <text x="195" y="246" width="35" height="9" page="vii">Module</text> </font> <font size="12" face="IJCINN+AvantGarde-Bold" color="#1BADEB"> <text x="194" y="292" width="36" height="12" page="vii">Part II Planni +ng 56 </text> </font> </doc>
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Trouble in text manipulation
by thirilog (Acolyte) on Aug 06, 2010 at 11:00 UTC | |
by graff (Chancellor) on Aug 06, 2010 at 16:48 UTC |