Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.

Comparing HTML snippets

by szabgab (Priest)
on Feb 28, 2014 at 07:14 UTC ( [id://1076475] : perlquestion . print w/replies, xml ) Need Help??

szabgab has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I need to test some HTML files if they contain the expected markup and I thought what if I gave an expected sub-html and then use something (e.g. HTML::TreeBuilder to check if the tree built up from the expected snippet matches some subtree of the received HTML. This should disregard white space where they are irrelevant and it would disregard the order of attributes inside a tag. So
<li>text more <a href="..." alt="name">anchor</a></li>
would be accepted even if it was written
<li> text more <a alt="name" href="...">anchor</a </li>
Is this a good idea? What problems will arise? Is there a module already doing something like this?

Replies are listed 'Best First'.
Re: Comparing HTML snippets
by Tux (Canon) on Feb 28, 2014 at 07:24 UTC

    Your approach looks great. To get all tags in the entered order without content:

    use 5.16.2; use warnings; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_content (do { local $/; <> }); my @tags = map { $_->tag } $tree->look_down (_tag => qr{.});

    Enjoy, Have FUN! H.Merijn
Re: Comparing HTML snippets
by chrestomanci (Priest) on Mar 02, 2014 at 17:08 UTC

    HTML::TreeBuilder creates a very complex doubly linked tree. Very useful if you want to parse HTML, and walk between nodes in the parse tree, but less helpful if you are trying to find differences between two bits of HTML.

    If you have two bits of HTML that are identical except for white-space (capitalisation, etc), then HTML::TreeBuilder should return identical trees for them, but there is a slight difference, then I think you will find it hard to find the difference by comparing the trees. For example if you use is_deeply out of Test::More it will find an enormous number of differences and then show you the first one it finds that is unlikely to be helpful to your problem.

    Perhaps a better solution would be to not use perl, and instead pipe your HTML samples through a tool like html-tidy and then do a text diff of the output.

Re: Comparing HTML snippets
by KevinZwack (Chaplain) on Mar 02, 2014 at 14:33 UTC

    You might want to take a look at XML::SemanticDiff

Re: Comparing HTML snippets
by Anonymous Monk on Mar 04, 2014 at 15:04 UTC
    Or just process the file to remove all blank-like characters, newlines and so-on, then see what something like the diff command-line command (or its Windows equivalent) will do for you.