I am attempting to write a program that parses an HTML file and then does some text substitution. I am running into a problem, however. Perhaps it is because I don't understand how to use the HTML::Tokeparser::Simple package. Here's what I am trying to do: Loop through a file looking for either <span> or <div> tags. It then checks to see if there is a certain attribute there (editable='true'). If so, I want it to grab all text (including any additional tags) between it and the closing tag of this tag "set". Some sample XHTML to illustrate:
<div class="content"> <span editable="true" id="nvumaincontent">stuff</span> <span editable="true" optional="true" id="nvutest5"><span style="backg +round: red;">more stuff</span><p>test</p></span><span editable="true" id="nvu56">even more stuff <!-- this is the beginning of a comment --> </span> <div editable="true" optional="true" repeatable="true" movable="true" id="nvutest43">incredible boat loads of stuff <!-- this is another comment --> </div> <div editable="true" id="anotherblock4">an unbelievable quantity of stuff! <!-- yet another comment --> <div id="newtest">Yo, dude!</div> </div> <!-- end main content --> </div>

I am mostly getting the results I want with one exception. If you see the lines above that have more than one <.span> or </div> in a row, I am only able to get the first of those tags. Is there anything I can do to tell whether or not the </span> or </div> tags actually go with the relevant opening tag? I am posting some code as follows:
use File::Find; use strict; use HTML::TokeParser::Simple; #my $new_folder = 'new_html/'; my @html_docs = "test5.html"; our $spancontents=""; my @files; my $ByteCount=0; my $filelist=""; my $isflagon=0; my $idflag; my %spancontents; my $templatelocation; my $currentdoc; foreach my $doc ( @html_docs ) { $currentdoc=$doc; my $p = HTML::TokeParser::Simple->new( file => $doc ); while ( my $token = $p->get_token ) { if ($token->is_start_tag('span') or $token->is_start_tag('div' +)) { if ($token->get_attr('editable')=~/true/) { $isflagon=1; $idflag=$token->get_attr('id'); } } if ( ($token->is_start_tag('span') and $isflagon) .. $token->i +s_end_tag('span') and $isflagon){ my $text=$token->as_is; $spancontents.=$text.","; #next; } if ( ($token->is_start_tag('div') and $isflagon) .. $token->is_end +_tag('div')){ my $text=$token->as_is; $spancontents.=$text.","; #next; #not sure if needed, seems to mess things up } if (($token->is_end_tag('span') or $token->is_end_tag('div')) +and $isflagon) { $isflagon=0; #$spancontents.=$token->as_is.","; #not sure if needed, seems +to mess things up $spancontents{"$idflag"}.=$spancontents; $spancontents=""; } if ($token->is_start_tag('html')) { my $attrs=$token->get_attr('templateref'); $templatelocation=$attrs; } } } print "\n\n\n"; foreach my $value (keys %spancontents) { print "value is $value\n"; print "\nMy $value = $spancontents{$value} \n\n------------------ +-------\n"; }

Here is some sample output using similar HTML as above:
value is anotherblock4 My anotherblock4 = <div editable="true" id="anotherblock4">,an, unbeli +evable qua ntity of stuff! ,<!-- yet another comment -->, ,<div id="newtest">,Yo, dude!,</div>, ------------------------- value is nvutest43 My nvutest43 = <div editable="true" optional="true" repeatable="true" +movable="t rue" id="nvutest43">,incredible boat loads of stuff ,<!-- this is another comment -->, ,</div>, ------------------------- value is nvutest5 My nvutest5 = <span editable="true" optional="true" id="nvutest5">,<sp +an style=" background: red;">,more stuff,</span>, ------------------------- value is nvumaincontent My nvumaincontent = <span editable="true" id="nvumaincontent">,stuff,< +/span>, ------------------------- value is nvu56 My nvu56 = <span editable="true" id="nvu56">,even more stuff ,<!-- this is the beginning of a comment -->, ,</span>, -------------------------

Notice that there is only one div or span closing tag under sections nvutest5 and anotherblock4. There should be two of them (i.e. two div's or two span's). My bottom line question is this: how can I tell which opening tag that the closing tag I am retrieving using get_end_tag goes to? Thanks for any help anyone can give and thanks for making this module available. Joshua Cook

Edit by castaway - Added readmore tags


In reply to Question regarding use of HTML::Tokeparser::Simple by chariscomp

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.