Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a page of text that would like to grab lines from. The way I'm trying to do is

$page =~ /<b>(.*)<\/b>/sg;
Know the problem is that this pattern is matched more than once.
My question is. Each time it matches this pattern is it stored in $1 $2 $3 or is all pattern matched stored in $1

Replies are listed 'Best First'.
Re: Help Pattern Matching
by thelenm (Vicar) on Oct 25, 2002 at 20:29 UTC
    To answer your direct question, the text captured by the parentheses is placed into $1 each time. So you can do this:
    while ($page =~ /<b>(.*?)<\/b>/g) { # Now $1 contains the matched text }

    However, there are a few problems with the regex itself that you should be aware of. First, you're using .*, which matches as much as it can. So if your text is "<b>foo</b> <b>bar</b>", the parentheses will capture "foo</b> <b>bar"... not what you expect. Using .*? (non-greedy) will correct that problem.

    Second, you say you're matching "lines", but you're also using the /s modifier on your regex, which means that the dot will match newlines. If you don't want newlines to be able to match a dot in your regex, then don't use /s.

    Third, if you're extracting data from HTML, and especially if you anticipate doing this for more than your single page of text, you'll probably want to use an HTML parser module. HTML::Parser or HTML::TokeParser are examples. Good luck!

    -- Mike

    --
    just,my${.02}

      Thank you for your reply.
      I'm grabbibg the web site as a single string thats why I'm using the /s regex. I was just trying to find a easier way of getting the parts I needed. Thank You For Your Help
Re: Help Pattern Matching
by tadman (Prior) on Oct 25, 2002 at 20:29 UTC
    A few things:
    • The /s flag indicates that the dot is presumed to include newlines.
    • The /g flag indicates that this pattern is going to return a list of all matches.
    So, in effect, you will only have a single $1 at the end, but you wouldn't use those. Instead:
    my @grabbed = $page =~ m#<b>(.*?)</b>#sig;
    A few changes. First, dot-star will grab as much as it can, being greedy. This isn't good since you'll get everything from the first <B> tag to the last close, or in other words, one big match instead of smaller ones. The question mark causes the regular expression to find the shortest match instead.

    The /i modifier also catches tags that are capitalized. Another trick is to use the hash mark instead of slash, so that you don't have to escape your slashes.

    If you're feeling more adventuresome, you might want to check out HTML::Parser.
Re: Help Pattern Matching
by Enlil (Parson) on Oct 25, 2002 at 20:33 UTC
    The pattern you have listed would only match once as it is. The reason is the .* which would match everything past the <b> and then backtrack until it found a </b>. which in the scheme of things is probably not what you want. So even a line like:

    <b>this is something</b><b>this is something else</b>
    the $1 would match this is something</b><b>this is something else

    what you probably want is to add a .*? instead of .* which would lazily match up to the first </b> I know this isn't quite the answer you were looking for, but thought it deserved mentioning.

    -enlil

Re: Help Pattern Matching
by robartes (Priest) on Oct 25, 2002 at 20:32 UTC
    The match is only stored in $1. However, if you evaluate the match in list context, you get all matches:
    use strict; my $string="camel_camel_camel"; print join "\n",($string =~ /camel/g); __END__ camel camel camel

    CU
    Robartes-