Cleaning HTML Fragments with open tags

learnedbyerror has asked for the wisdom of the Perl Monks concerning the following question:

Oh learned monks,

I return with another question who's answer eludes me.

I have a corpus of over 22,000,000 html fragments where each fragment is a post in a discussion forum. I desire to create a web page that aggregates all of the posts of one user into a single simple html page to be served from a static web site. The logic for doing this is complete and overall works well. What doesn't work well is where some number of posts have open tags that are not closed in the original fragment. For the most part, the resulting change in colors, fonts, font style ... is just noise. There are some number of cases though where the fragment contains open tags that render the page unreadable after that point. As such, the noise become a problem.

I was able to use the HTML::Tidy and quickly add code to detect problems with the parse and messages method and then call clean to resolve the problems and then grab the comments in the body for further use. Functionally, this works perfectly. The problem is the run time increased from 7 mins to 78 mins. Given that updates usually don't occur more than once a day, I can live with this if I must

My question to you if must I?. I did a small proof of concept using regex to detect the problem with the font tag and the impact to run time was minimal. I can extend that manual, explicit pattern, but I would prefer not to do that. I'm hoping that someone has already develop a wise way to do address this problem in a more performant manner.

Thank you in advance for your consideration and advice!

lbe

Code example using HTML::Tidy

use HTML::Tidy

Inside a Moo Object:

has tidy => (
    is      => 'rw',
    lazy    => 1,
    builder => '_build_tidy',
    isa  => InstanceOf ["HTML::Tidy"],
);

sub _build_tidy {
    my $self = shift;
    my $tidy = HTML::Tidy->new( 
        {
            #doctype      => 'omit',
            output_xhtml => 1,
            tidy_mark    => 0,
        }
    );
    $tidy->ignore( 
        text => 'missing <!DOCTYPE> declaration',
        text => 'inserting implicit <body>',
        text => 'inserting missing \'title\' element',
        text => 'missing </font>',
        text => '<blink> is not approved by W3C',
        text => 'plain text isn\'t allowed in <head> elements',
        text => '<head> previously mentioned',
    );
    return( $tidy );
 }


sub _clean_html {
    my ( $self, $html ) = @_;

    $self->tidy->clear_messages(); 
    $self->tidy->parse( "1", $html );

    if ( $self->tidy->messages ) {
        $html = $self->tidy->clean( $html );
        $html =~ m/<body>\n(.*)\n<\/body>/msgix;
        $html = $1;
    }

    return $html;
}
[download]

I call $object->_clean_html( $html ) where $html contains the html fragment in question and returns the cleansed html

Update 1: 2018/10/28

As I looked into modifying the code as I described in my earlier response, I realized that I would need to change several modules. So I turned back to my original approach. Using HTML::Tidy on every fragment is a non-starter for the reasons mentioned in my original post. I returned to CPAN to do some more research. I remembered a past project where I had used XML::LibXML to parse html and knew it could be configured to throw and error if the html was not well formed. I quickly got this working, but it was almost as slow as HTML::Tidy. I started looking for other modules that might be faster.

I found XML::Fast and wrote a small proof of concept with it. It worked; however I couldn't find a way to keep it from emitting verbose error messages to the console. I would need to use something like Capture::Tiny to keep that from happening. The overhead of forking every call would slow things down too much.

I moved on to mod::XML::Bare and thought I had a winner. It was very fast and seemed to work. It was not until I threw a full set of questionable examples at it that I realized it was tolerating some unclosed flags, like <font>

I decided to take an all to TIMTOWDI approach and use regexes. Please hold all flames! Yes, I know parsing html with regex is an accident waiting to happen. However, I also know from time to time that I have had to do it for one reason or another. In most of those cases, I needed to extract something from a html format that was well known and consistent, unlike my current corpus. But after a bit of effort, I have something that works for every test case that I have thrown at it and is +/- 150 times faster than calling HTML::Tidy clean on every fragment. My solution, slightly reworked for here, is:

sub _is_html_clean {
    # create state variable contain hash of unbalanced tags
    # that will persist across calls
    state $is_unbalanced = {
        area       => 1,
        base       => 1,
        basefont   => 1,
        bgsound    => 1,
        br         => 1,
        col        => 1,
        colgroup   => 1,
        embed      => 1,
        frame      => 1,
        hr         => 1,
        img        => 1,
        input      => 1,
        isindex    => 1,
        li         => 1,
        link       => 1,
        marquee    => 1,
        meta       => 1,
        p          => 1,
        '!doctype' => 1,
    };

    # remove self closing tags
    $_[0] =~ s/(.*)<.+?\/>/$1/g;

    # remove commented sections
    $_[0] =~ s/<!--.+?-->//msg;

    # load tag names in array
    my (@a) = ( $_[0] =~ m/<(\S+?)[ >]/msg );

    # process each tag counting the open and closes and
    # then increment or decrement a counter for that tag
    my %h;
    foreach (@a) {
        if (m[^/]) {     # closing tag
            substr( $_, 0, 1 ) = "";     # remove the /
            $h{$_}--;
        }
        else {
            $h{$_}++;
        }
    }

    foreach ( keys %h ) {
        if (m/[A-Z]/) {
            # combine keys in case insensitive manner
            $h{ lc($_) } += $h{$_};
            delete $h{$_};
        }
    }

    foreach ( sort keys %h ) {
        next if ( $is_unbalanced->{$_} ); # ignore if tag is in the is
+ unbalanced hash
        if ( $h{$_} != 0 ) {
            return 0;     # return as soon as an non-paired tag is fou
+nd
        }
    }

    return 1;     # return if all is good
}
[download]

I call is_html_clean for each fragment. If it fails, I call HTML::Tidy clean

This approach is effective for me for my current project. If I needed this functionality in a full production solution, I would likely write a state engine in a systems language and consume it via XS or FFI. It would be great if that functionality would be added to tidyp; however, given the intent of tidyp, it probably would not be added.

I hope that someone will find update beneficial.

Cheers, lbe

Comment on Cleaning HTML Fragments with open tags Select or Download Code

Replies are listed 'Best First'.
Re: Cleaning HTML Fragments with open tags by haukex (Archbishop) on Oct 23, 2018 at 18:32 UTC
Do you rerun the code on the 22e6 fragments on every run? Why not just run the code on those that have changed and store the already cleaned ones in a cache?	[reply]
Re^2: Cleaning HTML Fragments with open tags by learnedbyerror (Monk) on Oct 23, 2018 at 20:26 UTC
haukex, For this operation, yes, I run it on each fragment. There are a number of routines that run against each fragment. In most of them, I need the data to be in the original form. One option that I have considered, is to do as you propose and pre-process all of the raw fragments and store them in a different data base. I am using LMDB as the database. I wrote several tests to compare the processing time. I found that the routines are usually IO bound. As a result, I have minimized my IO reads and try to use the raw data multiple times while it is in memory. As a result, I am where I described. Your note did make me think of al alternative though that would still help to keep the reads almost to the minimum, just only slightly larger than the current approach. A little additional information first. My ingestion process reads the raw html files, parses the files and then writes to several databases. The first contains a compressed, serialized object (<Sereal>) of the whole html page as well as it constituents already parsed out. The second contains a compressed serialized version of each fragment. I calculate an MD5sum on the compressed, serialized object and use this as the key. Third is an index database that is configured to allow duplicate keys where the keys are each user and the values are the MD5sum for the fragment. The size of the index data is very small and the amount of processing time to access it very small. The idea that you triggered is to check each fragment. If the fragment has errors, use HTML::Tidy to clean. Then save a new record to the fragment database that contains the clean version. Create a fourth database similar to the third, that allows duplicate keys with the user as and containing the MD5sum for the clean version of the fragment. In the vast majority of the cases, that fragment will be the raw fragment. For the remainder, it is the corrected fragment. This approach has a small impact on the database size and will consume a relative small amount of additional RAM, about 200MB. It cuts outs the repeated calculation overhead. The changes to the code should also be pretty small! I'll give it a try and will let you know what I find Overall, I am still interested in learning if there are better performant ways of using my first approach. While I expect to incur some overhead, what I realized was much larger than expected. Thanks! lbe	[reply]