in reply to 3 capture multi line regex

This code will help you do the job you want:
#!usr/bin/perl my $html = <<'HTML'; <li><a class="style5" href="http://www.site.com/page.html"> some words here</a> - <a class="style3" href="http://www.site.com/page2.html"> "some words here"</a> </li> HTML my $regex = qr{ <a \s+ class\s*=\s*"style5" \s+ href\s*=\s*[\"\'] [^\"\']+ [\"\']\s*> #first href (not c +aptured) \s*([^<>]+?)\s* #text inside first <a></a> +(captured) </a>\s* -\s* <a \s+ class\s*=\s*"style3" \s+ href\s*=\s*[\"\'] ([^\"\']+) [\"\']\s*> #second href (not +captured) \s*([^<>]+?)\s* #text inside second <a></a> + (captured) </a> }xi; $html =~ /$regex/; print join "\n", $1, $2, $3, "";
Just modify the style names in the qr'ed regex and use it.

But if you want something more than a one-time solution for a very certain case, it'll be better to study HTML parsing modules mentioned in the comment above.

Replies are listed 'Best First'.
Re^2: 3 capture multi line regex
by Anonymous Monk on Jun 30, 2006 at 18:23 UTC
    Can you explain what these two lines are doing? Like what is the 2nd part in the first line doing? and how about the <> in the 2nd line?
    href=[\"\'] ([^\"\']+) [\"\']> #second href (not captured) \s*([^<>]+?)\s* #text inside second <a></a>
      i modified the regex a bit (i found a bug there), so two lines you mentioned become
      href\s*=\s*[\"\'] [^\"\']+ [\"\']\s*> #first href (not c +aptured) \s*([^<>]+?)\s* #text inside first <a></a> +(captured)
      At first line, i find a href= string followed by quotes (single or doble — ["']) containing string free of quoting symbols (i used a negated character class: [^"'] means NOT ["']).
      At the next line i simply find a text without tags within. If you think there will be another tags within your link, it would be better to use
      \s*(.+?)\s* # non-greedy capturing of everything till the +next </a>
      instead.
        Hi.

        Your regex matched fine the first time but I need to put all occurences into an array. I can't get the array to hold anything now

        push (@results, "$1::$2::$3"), $result_content =~ m/$regex/;
        I tried adding /g to the end but it doesn't contain anything at all. I tried adding /g to the regex itself but it errors out.

        What am I doign wrong?