comment on

Oh learned monks,

I return with another question who's answer eludes me.

I have a corpus of over 22,000,000 html fragments where each fragment is a post in a discussion forum. I desire to create a web page that aggregates all of the posts of one user into a single simple html page to be served from a static web site. The logic for doing this is complete and overall works well. What doesn't work well is where some number of posts have open tags that are not closed in the original fragment. For the most part, the resulting change in colors, fonts, font style ... is just noise. There are some number of cases though where the fragment contains open tags that render the page unreadable after that point. As such, the noise become a problem.

I was able to use the HTML::Tidy and quickly add code to detect problems with the parse and messages method and then call clean to resolve the problems and then grab the comments in the body for further use. Functionally, this works perfectly. The problem is the run time increased from 7 mins to 78 mins. Given that updates usually don't occur more than once a day, I can live with this if I must

My question to you if must I?. I did a small proof of concept using regex to detect the problem with the font tag and the impact to run time was minimal. I can extend that manual, explicit pattern, but I would prefer not to do that. I'm hoping that someone has already develop a wise way to do address this problem in a more performant manner.

Thank you in advance for your consideration and advice!

lbe

Code example using HTML::Tidy

use HTML::Tidy

Inside a Moo Object:

has tidy => (
    is      => 'rw',
    lazy    => 1,
    builder => '_build_tidy',
    isa  => InstanceOf ["HTML::Tidy"],
);

sub _build_tidy {
    my $self = shift;
    my $tidy = HTML::Tidy->new( 
        {
            #doctype      => 'omit',
            output_xhtml => 1,
            tidy_mark    => 0,
        }
    );
    $tidy->ignore( 
        text => 'missing <!DOCTYPE> declaration',
        text => 'inserting implicit <body>',
        text => 'inserting missing \'title\' element',
        text => 'missing </font>',
        text => '<blink> is not approved by W3C',
        text => 'plain text isn\'t allowed in <head> elements',
        text => '<head> previously mentioned',
    );
    return( $tidy );
 }


sub _clean_html {
    my ( $self, $html ) = @_;

    $self->tidy->clear_messages(); 
    $self->tidy->parse( "1", $html );

    if ( $self->tidy->messages ) {
        $html = $self->tidy->clean( $html );
        $html =~ m/<body>\n(.*)\n<\/body>/msgix;
        $html = $1;
    }

    return $html;
}
[download]

I call $object->_clean_html( $html ) where $html contains the html fragment in question and returns the cleansed html

Update 1: 2018/10/28

As I looked into modifying the code as I described in my earlier response, I realized that I would need to change several modules. So I turned back to my original approach. Using HTML::Tidy on every fragment is a non-starter for the reasons mentioned in my original post. I returned to CPAN to do some more research. I remembered a past project where I had used XML::LibXML to parse html and knew it could be configured to throw and error if the html was not well formed. I quickly got this working, but it was almost as slow as HTML::Tidy. I started looking for other modules that might be faster.

I found XML::Fast and wrote a small proof of concept with it. It worked; however I couldn't find a way to keep it from emitting verbose error messages to the console. I would need to use something like Capture::Tiny to keep that from happening. The overhead of forking every call would slow things down too much.

I moved on to mod::XML::Bare and thought I had a winner. It was very fast and seemed to work. It was not until I threw a full set of questionable examples at it that I realized it was tolerating some unclosed flags, like <font>

I decided to take an all to TIMTOWDI approach and use regexes. Please hold all flames! Yes, I know parsing html with regex is an accident waiting to happen. However, I also know from time to time that I have had to do it for one reason or another. In most of those cases, I needed to extract something from a html format that was well known and consistent, unlike my current corpus. But after a bit of effort, I have something that works for every test case that I have thrown at it and is +/- 150 times faster than calling HTML::Tidy clean on every fragment. My solution, slightly reworked for here, is:

sub _is_html_clean {
    # create state variable contain hash of unbalanced tags
    # that will persist across calls
    state $is_unbalanced = {
        area       => 1,
        base       => 1,
        basefont   => 1,
        bgsound    => 1,
        br         => 1,
        col        => 1,
        colgroup   => 1,
        embed      => 1,
        frame      => 1,
        hr         => 1,
        img        => 1,
        input      => 1,
        isindex    => 1,
        li         => 1,
        link       => 1,
        marquee    => 1,
        meta       => 1,
        p          => 1,
        '!doctype' => 1,
    };

    # remove self closing tags
    $_[0] =~ s/(.*)<.+?\/>/$1/g;

    # remove commented sections
    $_[0] =~ s/<!--.+?-->//msg;

    # load tag names in array
    my (@a) = ( $_[0] =~ m/<(\S+?)[ >]/msg );

    # process each tag counting the open and closes and
    # then increment or decrement a counter for that tag
    my %h;
    foreach (@a) {
        if (m[^/]) {     # closing tag
            substr( $_, 0, 1 ) = "";     # remove the /
            $h{$_}--;
        }
        else {
            $h{$_}++;
        }
    }

    foreach ( keys %h ) {
        if (m/[A-Z]/) {
            # combine keys in case insensitive manner
            $h{ lc($_) } += $h{$_};
            delete $h{$_};
        }
    }

    foreach ( sort keys %h ) {
        next if ( $is_unbalanced->{$_} ); # ignore if tag is in the is
+ unbalanced hash
        if ( $h{$_} != 0 ) {
            return 0;     # return as soon as an non-paired tag is fou
+nd
        }
    }

    return 1;     # return if all is good
}
[download]

I call is_html_clean for each fragment. If it fails, I call HTML::Tidy clean

This approach is effective for me for my current project. If I needed this functionality in a full production solution, I would likely write a state engine in a systems language and consume it via XS or FFI. It would be great if that functionality would be added to tidyp; however, given the intent of tidyp, it probably would not be added.

I hope that someone will find update beneficial.

Cheers, lbe

In reply to Cleaning HTML Fragments with open tags by learnedbyerror

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.