pcadv has asked for the wisdom of the Perl Monks concerning the following question:

Good Evening All
Ok I'm just not getting it,
If there is a solution here could someone point me to it. I'm trying to change the html equivalent for a double quote &quot; back to a double quote. But I only need this done in html tags basically any tag that starts with < and ends with >
Thank You for any assistance, I seem to be adding to my grey hair.

Replies are listed 'Best First'.
Re: Matching HTML Tags
by TedPride (Priest) on May 24, 2005 at 01:26 UTC
    I have no idea why you're trying to do this, but the following code should work as specified:
    use strict; use warnings; my $str = '&nbsp; <x &nbsp;<y &nbsp;>>'; my $c = 0; my @arr; while ($str =~ /<[^<>]+>/) { $str =~ s/<([^<>]+)>/`$c`/; $arr[$c++] = $1; } s/&nbsp;/"/ig for @arr; while ($str =~ /`\d+`/) { $str =~ s/`(\d+)`/<$arr[$1]>/g; } print $str;
    The hard part is dealing with nested tags, but if you start with the inside tags and work outwards, you have no real problems.
      Nice! (but s/&nbsp;/&quot;/ in your code for the present case, of course...)
      chas
Re: Matching HTML Tags
by chas (Priest) on May 24, 2005 at 01:18 UTC
    You might give an example of what you are trying to do. It's not clear to me why &quot would occur inside html tags. Generally that construction would appear outside such tags in the text of the page in order to cause display of actual quotes. Inside tag brackets, actual quotes should be appropriate.
    chas
    (Update: If you are trying to correct an error, a substitution would likely achieve what you want. If you've tried this, you might exhibit your code attempts.)
      Well I was building a script that did exactly that changing the double quotes in an HTML document to &quot;
      But the side effect changes even the double quote inside the tags.
      This doesn't work for me. I then just tried to change them back to double quots but I could get that to work for me either. As you might have guessed I'm very new to this. But I know from the little I've been working on I can do this I'm just not sure how.
      Thanks
        Go back to your original problem, it is simpler and saner.

        Don't try to parse HTML with regular expressions. You will always almost find a partial solution, but it will break in more and more esoteric ways.

        Instead do something like this:

        use HTML::TokeParser::Simple; # time passes... my $parser = HTML::TokeParser::Simple->new( string => $original_html ) +; while (my $token = $parser->get_token) { $token->rewrite_tag; if ($token->is_text) { my $t = $token->as_is; $t =~ s/"/&quote;/g; # More generically you could URI::Escape print $t; } else { print $token->as_is; } }
        and now your problem is solved. Well, unless you have JavaScript in the HTML, in which case you have another world of grief to deal with, though you can start with:
        use HTML::TokeParser::Simple; # time passes... my $parser = HTML::TokeParser::Simple->new( string => $original_html ) +; while (my $token = $parser->get_token) { $token->rewrite_tag; if ($token->is_text) { my $t = $token->as_is; $t =~ s/"/&quote;/g; print $t; } elsif ($token->is_start_tag("script")) { # Embedded JavaScript, do not mess up! print $token->as_is; while ($token = $parser->get_token) { print $token->as_is; if ($token->is_tag("script") and not $token->is_start_tag) { last; # done with JavaScript } } } else { print $token->as_is; } }

        I had to do the same thing a couple of days ago. Here is my HTML::Parser solution (to complement tilly's HTML::TokeParser version above). It encodes any special characters found in the text portion of an HTML doc.

        use HTML::Parser; use HTML::Entities; my $html = '<div align="center">Your "HTML" page goes here</div>'; my $enc = ''; my $p = HTML::Parser->new( unbroken_text => 1, default_h => [ sub { $enc .= join('', @_) }, "text" ], text_h => [ sub { $enc .= HTML::Entities::encode_entities($_[0]) }, +"text" ], ); $p->parse($html); print $encoded;

        Handling JavaScript is left as an exercise for the reader ;)

        - Cees

Re: Matching HTML Tags
by astroboy (Chaplain) on May 24, 2005 at 01:37 UTC
    There isn't much context inyour questions so I may be off base. Are you generating an HTML page, but you want your content to be escaped? If so, have a look at URI::Escape. It esacped all non safe characters. You'll need to escape your content before generating the HTML, of course, or your tags will be escaped too.