Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Can someone help me get my regex to match?

I need to match the following

<li><a class="style5" href="http://www.site.com/page.html"> some words here</a> - <a class="style3" href="http://www.site.com/page2.html"> "some words here"</a> </li>
What I have so far is
push (@results, "$1::$2::$3"), $result_content =~ m#style="5" href +=".+">\s+(.+)?</a>\s+-\s+\<a class="style3" href="(.+)?">(.+)?</a>#gi +s;
What I need to match: $1 = the text inside the first link's text, $2 = the 2nd link's URL, $3 - the 2nd link's text.

The links can be any links, so I'm not literally matching for this one, of course. Sometimes there are quotes in the 2nd link's text, sometimes it's a ' instead of ", so I just want to match whatever is in that text part. There are infinitely many of these.

If someone can show me how to fix this I'd be very much appreciative. Also if someone can show me how to do this with one of those modules people use for HTML regexes so it's more stable, I'd be very interested to see how it's done.

Replies are listed 'Best First'.
Re: 3 capture multi line regex
by Fletch (Bishop) on Jun 30, 2006 at 17:55 UTC

    People that know what they're doing (and that can't guarantee a very specific structure to their HTML) don't use regexen to parse HTML. Use HTML::TreeBuilder or HTML::TokeParser::Simple.

    In the former module you'd use look_down to find <li> elements with <a> elements of the desired styles and then pull out what you want.

    With the later you'd be interested in <li>. Once you see one you look for the next two anchors and again extract what you're interested in.

Re: 3 capture multi line regex
by Ieronim (Friar) on Jun 30, 2006 at 18:12 UTC
    This code will help you do the job you want:
    #!usr/bin/perl my $html = <<'HTML'; <li><a class="style5" href="http://www.site.com/page.html"> some words here</a> - <a class="style3" href="http://www.site.com/page2.html"> "some words here"</a> </li> HTML my $regex = qr{ <a \s+ class\s*=\s*"style5" \s+ href\s*=\s*[\"\'] [^\"\']+ [\"\']\s*> #first href (not c +aptured) \s*([^<>]+?)\s* #text inside first <a></a> +(captured) </a>\s* -\s* <a \s+ class\s*=\s*"style3" \s+ href\s*=\s*[\"\'] ([^\"\']+) [\"\']\s*> #second href (not +captured) \s*([^<>]+?)\s* #text inside second <a></a> + (captured) </a> }xi; $html =~ /$regex/; print join "\n", $1, $2, $3, "";
    Just modify the style names in the qr'ed regex and use it.

    But if you want something more than a one-time solution for a very certain case, it'll be better to study HTML parsing modules mentioned in the comment above.

      Can you explain what these two lines are doing? Like what is the 2nd part in the first line doing? and how about the <> in the 2nd line?
      href=[\"\'] ([^\"\']+) [\"\']> #second href (not captured) \s*([^<>]+?)\s* #text inside second <a></a>
        i modified the regex a bit (i found a bug there), so two lines you mentioned become
        href\s*=\s*[\"\'] [^\"\']+ [\"\']\s*> #first href (not c +aptured) \s*([^<>]+?)\s* #text inside first <a></a> +(captured)
        At first line, i find a href= string followed by quotes (single or doble — ["']) containing string free of quoting symbols (i used a negated character class: [^"'] means NOT ["']).
        At the next line i simply find a text without tags within. If you think there will be another tags within your link, it would be better to use
        \s*(.+?)\s* # non-greedy capturing of everything till the +next </a>
        instead.
Re: 3 capture multi line regex
by wfsp (Abbot) on Jul 01, 2006 at 06:15 UTC
    I agree with Fletch above. This uses HTML::TokeParser::Simple. It pushes all the anchor hrefs and anchor text onto an array. You can then easily choose which you need.

    imo this is much easier and more reliable than trying to build, debug and maintain a complex regex on something as loose as HTML.

    It's also easy to adapt if the spec changes (and doesn't it always) or to other parsing tasks as they arise.

    I believe using a parser (and there are many to suit all tastes) is a big win on all counts and I highly recommend it.

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use HTML::TokeParser::Simple; my $html; { local $/; $html = <DATA> } my $p = HTML::TokeParser::Simple->new(\$html); $p->unbroken_text(1); my ($in_li, @record, @db); while (my $t = $p->get_token){ $in_li++, next if $t->is_start_tag('li'); next unless $in_li; if ($t->is_end_tag('li')){ push @db, [@record]; $in_li = 0; next; } if ($t->is_start_tag('a')){ push @record, $t->get_attr('href'); my $text = $p->get_trimmed_text('/a'); push @record, $text; } } #die Dumper \@db; # the text inside the first link's text, the 2nd link's URL, the 2nd l +ink's text. for my $record (@db){ my @field = @{$record}; print $field[1], "::", $field[2], "::", $field[3], "\n"; } __DATA__ <li> <a class="style5" href="http://www.site.com/page.html"> some words here </a> - <a class="style3" href="http://www.site.com/page2.html"> "some words here" </a> </li>
    output:
    ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl some words here::http://www.site.com/page2.html::"some words here" > Terminated with exit code 0.