Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Converting span tags.

by the_0ne (Pilgrim)
on Mar 27, 2006 at 16:12 UTC ( [id://539474]=perlquestion: print w/replies, xml ) Need Help??

the_0ne has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I am in need of assistance. I have a task where I need to convert spans in a chunk of html to bold, underline and/or italicize tags. My main problem is nested span tags. Of course there can be a chunk of text that is bolded and then one word in the middle of that span that is italicized. I can not figure out how to write something that will take care of that nesting. I have tried homegrown perl code and also HTML::Parser to no avail...

Here is my sample text...
this <span style="font-weight: bold;">is</span> some <span style="font-weight: bold;">test <span style="font-style: italic;">text</span> <span style="text-decoration: underline;">for</span> bolding</span>, + underlining and italicizing text.<br />
The script needs to change it to this...
this <b>is</b> some <b>test <i>text</i> <u>for</u> bolding</b>, underl +ining and italicizing text.<br />
I have no trouble with an open span (the formatting), some text and then a close span. My main problem is nested span's. Can't figure out how to match up span's with their respective close spans.

Thanks in advance Monks for your service.

Replies are listed 'Best First'.
Re: Converting span tags.
by eric256 (Parson) on Mar 27, 2006 at 16:18 UTC

    I think you could use HTML::TreeBuilder to build a tree of the HTML. Then scan your tree converting spans with specific attributes to b,i, and u tags as needed. Then flatten the tree back into HTML.


    ___________
    Eric Hodges
Re: Converting span tags.
by davidrw (Prior) on Mar 27, 2006 at 16:43 UTC
    You could just do it iteratively instead of recursively (also assuming this goes in the one-time quick-n-dirty clean-up category) .. i.e. find all the inner <span ...>foo</span> instances (w/the help of a negative look-ahead--seeperlre) and replace them .. now repeat that until no more matches.
    my $s = do {local $/=undef; <DATA>}; while( $s =~ s#<span style="(font-weight: bold;|font-style: italic;|te +xt-decoration: underline;)">(?!.*?<span)(.*?)</span>#span2tag($1,$2)# +sgei ){}; print $s; sub span2tag { my ($attr, $s) = @_; return "<b>$s</b>" if $attr =~ /bold/; return "<i>$s</i>" if $attr =~ /italic/; return "<u>$s</u>" if $attr =~ /underline/; return $s; } __DATA__ this <span style="font-weight: bold;">is</span> some <span style="font-weight: bold;">test <span style="font-style: italic;">text</span> <span style="text-decoration: underline;">for</span> bolding</span>, + underlining and italicizing text.<br />
    Update: Probably a little less efficient, but clearer code:
    my $s = do {local $/=undef; <DATA>}; while(1){ my $matched = 0; $matched ||= $s =~ s#<span style="font-weight: bold;">(?!.*?<span)(. +*?)</span>#<b>$1</b>#sgi; $matched ||= $s =~ s#<span style="font-style: italic;">(?!.*?<span)( +.*?)</span>#<i>$1</i>#sgi; $matched ||= $s =~ s#<span style="text-decoration: underline;">(?!.* +?<span)(.*?)</span>#<u>$1</u>#sgi; last unless $matched; }
      Thanks again davidrw, this is what I changed it to...

      my $s = do {local $/=undef; <DATA>}; while( $s =~ s#<span style="([^"]+)">(?!.*?<span)(.*?)</span>#span2tag +($1,$2)#sgei ){}; sub span2tag { my ($attr, $s) = @_; if ($attr =~ /bold/) { $s = "<b>$s</b>"; } if ($attr =~ /italic/) { $s = "<i>$s</i>"; } if ($attr =~ /underline/) { $s = "<u>$s</u>"; } return $s; } __DATA__ this <span style="font-style: italic; font-weight: bold;">is</span> so +me this <span style="font-weight: bold;">is</span> some <span style="font-weight: bold; text-decoration: underline;">test <span style="font-style: italic;">text</span> <span style="text-decoration: underline;">for</span> bolding</span>, + underlining and italicizing
      That will take care of nested formats.
      Thanks a lot davidrw, that sample code seemed to work perfectly. I'll mess with it some more because my example didn't include that there could possibly be more than one format per span tag. Your code helped me out a lot though, thanks again.
Re: Converting span tags.
by wfsp (Abbot) on Mar 27, 2006 at 16:50 UTC
    I would consider using a stack (a last on first off array).

    Everytime you find an open span tag push the attribute onto the stack. When you find a close span tag pop the attribute off the stack. You'll then know which attribute you are closing.

    If the stack runs out of attributes or if you have some left over you'll also know that the span tags weren't balanced :-)

    Hope that helps.

    Update:

    I would _certainly_ use an HTML parser

    Update 2:

    Code added.

    #!/usr/bin/perl use warnings; use strict; use HTML::TokeParser::Simple; my $html; { local $/; $html = <DATA> } my @stack; my %lookup = ( 'font-weight: bold;' => 'b', 'font-style: italic;' => 'i', 'text-decoration: underline;' => 'u', ); my $tp = HTML::TokeParser::Simple->new(\$html); while (my $t = $tp->get_token){ if ($t->is_start_tag('span')){ my $attr = $t->get_attr('style'); my $tag = $lookup{$attr}; push @stack, $tag; print "<$tag>"; next; } if ($t->is_end_tag('span')){ my $tag = pop @stack; print "</$tag>"; next; } print $t->as_is; } __DATA__ this <span style="font-weight: bold;">is</span> some <span style="font-weight: bold;">test <span style="font-style: italic;">text</span> <span style="text-decoration: underline;">for</span> bolding</span>, + underlining and italicizing text.<br />
    Output

    ---------- Capture Output ---------- > "c:\perl\bin\perl.exe" _new.pl this <b>is</b> some <b>test <i>text</i> <u>for</u> bolding</b>, underlining and italicizing text.<br /> > Terminated with exit code 0.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://539474]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-04-16 15:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found