Greebo has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I hope that title was ok - not sure of correct terms to refer to this problem.
So I have a piece of code that is *supposed* to find links containing the phrase "email." and (ultimately with a separate s///) remove them.
Code: m/<a.+?href=['"].+?email\..*?\/a>/si
Unfortunately what that code does is match everything from the first "<a" tag through to the "</a>" after the "email.".
I'm still relatively new to perl and indeed regex. I've tried messing around with lookbehinds but I can't get my head round them properly, but I suspect they're key to what I need. Any ideas?
This is the data I'm working with:
<a href="detail.jsp?key=7147&rc=d_20071128&p=2&pv=1">Next Page</a><br/ +> <a class="foot" href="email.jsp?key=7147">E-mail Story</a>

Replies are listed 'Best First'.
Re: Regex to match first html tag previous to text
by erroneousBollock (Curate) on Nov 29, 2007 at 02:18 UTC
    I'd use HTML::TreeBuilder::XPath to find the nodes in question, then alter the values in the "DOM" and use methods from HTML::Tree to write the document back out.

    Regular expressions are a very fragile solution to the "how do I parse HTML" problem.

    -David

      Thanks for the reply - as it happened (as it so often happens whenever I post asking for help with something) I stumbled upon a regex solution to this problem by accident, but given the majority of people saying regex is not the way to go here, I will definitely look into this and the various other options people suggested
Re: Regex to match first html tag previous to text
by Your Mother (Archbishop) on Nov 29, 2007 at 02:39 UTC

    my $test = q{ <a href="detail.jsp?key=7147&rc=d_20071128&p=2&pv=1">Next Page</a><br/> <a class="foot" href="email.jsp?key=7147">E-mail Story</a> }; print "BEFORE: $test\n\n"; $test =~ s{<a[^>]+>.*?e-?mail[^<]*</a>}{}gim; print "AFTER: $test\n";

    That strips them outright, but probably you should not use it. It's naive and it will break (like most all regex-based markup parsing). What you want to do is either what erroneousBollock said, or I use XML::LibXML for XHTML. There are many other options in the X(HT)ML space. If you have dirtier HTML, those won't work. You can also do it in a stream with HTML::TokeParser.

Re: Regex to match first html tag previous to text
by tachyon-II (Chaplain) on Nov 29, 2007 at 03:08 UTC

    The most robust solution is to use HTML::Parser. Regexes are not recommended for parsing HTML although they can be made to work and it is probably a good learning experience.

    The . operator should generally be your last choice in a regex. To stay within a tag I would suggest something like this (untested):

    s!<a[^>]+href=['"]email[^>]+>[^<]+</a\>!DELETED!gi; # which becomes this to deal with whitespace issues s!<\s*a[^>]+href\s*=\s*['"]\s*email[^>]+>[^<]+<\s*/a\s*>!DELETED!gi;

    The key thing we are doing is using the NOT class syntax on the > and < parts of the tags to ensure we match everything but still remain reliably in the tag. The endless \s* are required to deal with the relaxed way HTML deals with whitespace.

Re: Regex to match first html tag previous to text
by aquarium (Curate) on Nov 29, 2007 at 03:01 UTC
    incidentally..does anybody know if any LWP or similar implement DOM? I have a hunch that DOM parsing is cleaner than X(HT)ML parsing....that's with the latter sometimes not being well formed etc....whilst DOM will always give you access to A tags.
    the hardest line to type correctly is: stty erase ^H
      does anybody know if any LWP or similar implement DOM
      LWP::UserAgent does not provide DOM-level access.

      WWW::Mechanize doesn't either, but does parse the HTML for you in order to provide methods like links(), which incidentally, does what you want.

      I have a hunch that DOM parsing is cleaner than X(HT)ML parsing
      "DOM" is not a manner of parsing, but a manner of access. For methods from the DOM to be able to access data from a tree of nodes, some "parser" code still has to build that tree.

      It's certainly cleaner to access data using DOM (or DOM-like) methods, or selector interfaces like XPath or XQuery.

      HTML::TreeBuilder::XPath builds an HTML::Tree internally and then provides XPath-like access to that tree.

      (HTML parsing) sometimes not being well formed etc
      If you're talking about the robustness of parsing HTML, there are many libraries that parse HTML properly even when given invalid input. It's quite orthogonal to how you access the data once you've parsed the document.

      -David

        parsing vs access methods does get blurry...anyway, in the end we're interested in getting to point B and not that interested in the trip itself...whether parsing or using an access method.
        to this point...i think (possibly) best for this problem would be find_link() method of WWWW::Mechanize. not as tedious as Xpath or HTML::Tree etc.
        the hardest line to type correctly is: stty erase ^H
Re: Regex to match first html tag previous to text
by wfsp (Abbot) on Nov 29, 2007 at 13:49 UTC
    A regular task of mine is rewriting html and my choice of poison is HTML::TokeParser::Simple. While it may look verbose it is a good trade off, imo, for having an intuitive interface. I find it easy to write and easy to read a week later. :-) ymmv.

    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(*DATA); my ($html, $in_email_link); while (my $t = $p->get_token){ $in_email_link++, next if $t->is_start_tag(q{a}) and $t->get_attr(q{href}) and $t->get_attr(q{href}) =~ m|email\.|; $in_email_link--, next if $in_email_link and $t->is_end_tag(q{a}); next if $in_email_link; $html .= $t->as_is; } print qq{$html}; __DATA__ <p>one</p> <a href="detail.jsp?key=7147&rc=d_20071128&p=2&pv=1">Next Page</a><br/ +> <p>two</p> <a class="foot" href="email.jsp?key=7147">E-mail Story</a> <p>three</p> <a href="detail.jsp?key=7147&rc=d_20071128&p=2&pv=1">Next Page</a><br/ +>
    output:
    <p>one</p> <a href="detail.jsp?key=7147&rc=d_20071128&p=2&pv=1">Next Page</a><br/ +> <p>two</p> <p>three</p> <a href="detail.jsp?key=7147&rc=d_20071128&p=2&pv=1">Next Page</a><br/ +>