morbus has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to encode a normal quote character (") into the HTML equivalent ("& quot"). I know I can easily s/"/& quot;/, but I'm looking for something which will encode all quotes outside of angle brackets (< >), in essence, leaving HTML alone. So, the following:
<a href="bob.html">"Hi!"</a>
would turn into:
<a href="bob.html">&quot;Hi!&quot;</a>
Ideas?

Replies are listed 'Best First'.
Re: Encode quotes except in HTML?
by lhoward (Vicar) on Sep 18, 2000 at 16:49 UTC
    Start with HTML::Parser (or one of its relatives like HTML::Filter). Writing a decent HTML parser inside a regular-expression is impossible.
    package HTML::Quoter; require HTML::Filter; @ISA=qw(HTML::Filter); my $data=''; sub output{ my $self=shift; my $d=$_[0]; if($d=~/\<\s*\/?\s*(\w+)/){ $data.=$d; }else{ $d=~s/\"/&quot;/gs; $data.=$d; } } my $p=HTML::Quoter->new(); $p->parse_file("quotes.html"); print $data;
Re: Encode quotes except in HTML?
by merlyn (Sage) on Sep 18, 2000 at 23:01 UTC
    If the idea of subclassing gets a bit messy, you can also use the V3 API to HTML::Parser directly, as in:
    use HTML::Parser; HTML::Parser->new( default_h => [sub { print shift; }, "text"], text_h => [sub { local $_ = shift; s/\"/&quot;/g; print; }, "text"], )->parse(join "", <DATA>); __END__ <a href="bob.html">"Hi!"</a>

    -- Randal L. Schwartz, Perl hacker

      I've noticed join "",<HANDLE> several times recently and just got curious exactly how much slower that is than the way I'd do it: local($/)= undef; <HANDLE>

      For 32KB of data, I get the join method being about 8 times slower than the "slurp" method. But they both read that 32KB in under 1/100th of a second and the join method can be written as a simple expression while doing so for the slurp method gets tricky. So I can certainly see going for the join method in some cases.

              - tye (but my friends call me "Tye")
        And if you want to be as efficient as possible, use:
        read DATA, $str, -s DATA;
        but that's not convienient for things like the above either. It's a laziness/efficiency tradeoff :)
        Updated due to merlyn's comment below (oops).