Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Dynamically cleaning up HTML fragments

by clinton (Priest)
on Sep 25, 2010 at 19:22 UTC ( [id://861995]=note: print w/replies, xml ) Need Help??


in reply to Dynamically cleaning up HTML fragments

Glad to see that you have noticed HTML::StripScripts::Parser. I'm the maintainer, but not the guy who did the great work of writing it originally.

It fulfils all of your listed requirements, and is certainly seeing active usage on our production sites.

This code should do what you need (untested):

my $s = HTML::Stripscripts::Parser->new({ Context => 'Flow', # Only allow these tags BanAllBut => [qw(p a img h3 div em)], # Allow src and href AllowSrc => 1, AllowHref => 1, Rules => { # remove empty p tags p => sub { return length $_[1]->{content} }, # a must have a local href a => { href => \&strip_abs_uri, tag => sub { return 0 unless $_[1]->{href} }, }, # img must have a local src img => { src => \&strip_abs_uri, tag => sub { return 0 unless $_[1]->{src} }, }, # Allow id and class for all tags '*' => { id => 1, class => 1, } }, }); sub strip_abs_uri { my ( $filter, $tag, $attr_name, $attr_val ) = @_; return 1 unless $attr_name =~/href|src/ return $attr_val=~m{://}; } print $s->filter_html($html);

Replies are listed 'Best First'.
Re^2: Dynamically cleaning up HTML fragments
by SilasTheMonk (Chaplain) on Sep 25, 2010 at 20:57 UTC
    Thanks. This module really is working for me. In fact it is the ONLY module that meets my requirements. HTML::Restrict might work but it uses Moose. Actually I want "title" attributes on anchors and I did not not like the handling of stripped code, so I had to subclass and add a few method overrides. But altogether it is petty easy to use. I am building up some test cases and adding in Benchmark'ing. It looks like writing a HTML::Parser subclass might be the only alternative (and faster) but requiring some skill. Have you thought of writing a module that takes a HTML::StripScripts spec and "compiles" it to a faster, slimmer direct subclass of HTML::Parser?
      Glad it is working for you.

      I really do not recommend writing your own HTML::Parser subclass. If you look at the source of HTML::StripScripts you will see that there is a lot going on there, and with good reason. If you write your own subclass, and you're not willing to spend the time checking every last detail, then you are likely to miss a whole lot of corner cases that HSS already deals with. Parsing HTML is a hard job, and even harder when you're trying to make sense of bad HTML.

      (Again, I write as the fortunate maintainer, and not as the original author who did all the painstaking work.)

      clint

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://861995]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-19 19:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found