chariscomp has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to write a program that parses an HTML file and then does some text substitution. I am running into a problem, however. Perhaps it is because I don't understand how to use the HTML::Tokeparser::Simple package. Here's what I am trying to do: Loop through a file looking for either <span> or <div> tags. It then checks to see if there is a certain attribute there (editable='true'). If so, I want it to grab all text (including any additional tags) between it and the closing tag of this tag "set". Some sample XHTML to illustrate:
<div class="content"> <span editable="true" id="nvumaincontent">stuff</span> <span editable="true" optional="true" id="nvutest5"><span style="backg +round: red;">more stuff</span><p>test</p></span><span editable="true" id="nvu56">even more stuff <!-- this is the beginning of a comment --> </span> <div editable="true" optional="true" repeatable="true" movable="true" id="nvutest43">incredible boat loads of stuff <!-- this is another comment --> </div> <div editable="true" id="anotherblock4">an unbelievable quantity of stuff! <!-- yet another comment --> <div id="newtest">Yo, dude!</div> </div> <!-- end main content --> </div>

I am mostly getting the results I want with one exception. If you see the lines above that have more than one <.span> or </div> in a row, I am only able to get the first of those tags. Is there anything I can do to tell whether or not the </span> or </div> tags actually go with the relevant opening tag? I am posting some code as follows:
use File::Find; use strict; use HTML::TokeParser::Simple; #my $new_folder = 'new_html/'; my @html_docs = "test5.html"; our $spancontents=""; my @files; my $ByteCount=0; my $filelist=""; my $isflagon=0; my $idflag; my %spancontents; my $templatelocation; my $currentdoc; foreach my $doc ( @html_docs ) { $currentdoc=$doc; my $p = HTML::TokeParser::Simple->new( file => $doc ); while ( my $token = $p->get_token ) { if ($token->is_start_tag('span') or $token->is_start_tag('div' +)) { if ($token->get_attr('editable')=~/true/) { $isflagon=1; $idflag=$token->get_attr('id'); } } if ( ($token->is_start_tag('span') and $isflagon) .. $token->i +s_end_tag('span') and $isflagon){ my $text=$token->as_is; $spancontents.=$text.","; #next; } if ( ($token->is_start_tag('div') and $isflagon) .. $token->is_end +_tag('div')){ my $text=$token->as_is; $spancontents.=$text.","; #next; #not sure if needed, seems to mess things up } if (($token->is_end_tag('span') or $token->is_end_tag('div')) +and $isflagon) { $isflagon=0; #$spancontents.=$token->as_is.","; #not sure if needed, seems +to mess things up $spancontents{"$idflag"}.=$spancontents; $spancontents=""; } if ($token->is_start_tag('html')) { my $attrs=$token->get_attr('templateref'); $templatelocation=$attrs; } } } print "\n\n\n"; foreach my $value (keys %spancontents) { print "value is $value\n"; print "\nMy $value = $spancontents{$value} \n\n------------------ +-------\n"; }

Here is some sample output using similar HTML as above:
value is anotherblock4 My anotherblock4 = <div editable="true" id="anotherblock4">,an, unbeli +evable qua ntity of stuff! ,<!-- yet another comment -->, ,<div id="newtest">,Yo, dude!,</div>, ------------------------- value is nvutest43 My nvutest43 = <div editable="true" optional="true" repeatable="true" +movable="t rue" id="nvutest43">,incredible boat loads of stuff ,<!-- this is another comment -->, ,</div>, ------------------------- value is nvutest5 My nvutest5 = <span editable="true" optional="true" id="nvutest5">,<sp +an style=" background: red;">,more stuff,</span>, ------------------------- value is nvumaincontent My nvumaincontent = <span editable="true" id="nvumaincontent">,stuff,< +/span>, ------------------------- value is nvu56 My nvu56 = <span editable="true" id="nvu56">,even more stuff ,<!-- this is the beginning of a comment -->, ,</span>, -------------------------

Notice that there is only one div or span closing tag under sections nvutest5 and anotherblock4. There should be two of them (i.e. two div's or two span's). My bottom line question is this: how can I tell which opening tag that the closing tag I am retrieving using get_end_tag goes to? Thanks for any help anyone can give and thanks for making this module available. Joshua Cook

Edit by castaway - Added readmore tags

Replies are listed 'Best First'.
Re: Question regarding use of HTML::Tokeparser::Simple
by tphyahoo (Vicar) on May 17, 2005 at 08:32 UTC
    I'm no expert on the html::parser::* modules but to get better answers on your post, instead of using
    foreach my $doc ( @html_docs ) {
    you would probably be better off posting a script with *input* that results in the problem you're having. I like having the input read in from <DATA>, for example like what I had at Celsius to Fahrenheit using s///

    Then the monks don't have to create test input to help you troubleshoot your problem