Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

push (@items, $1) while $saved_page =~ m#<\/a><br>(.+)?<br>\&nbsp\; +<b>#g;
Isn't working quite right. I change the (.+)? to (.*?) and it mathces 1 item, but it doens't match a lot of junk code like it does with my current code. It needs to match ALL items it finds.

The code I am trying to look through looks like

</a><br>Haunted Woods 1.13 WC Piece<br>&nbsp;<b>
I'm trying to match anything that appears between the two BR tags whether it's 1 word, multi words, hypens, numbers, etc.

Replies are listed 'Best First'.
Re: html regex problem
by marto (Cardinal) on Apr 19, 2006 at 15:27 UTC
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: html regex problem
by ikegami (Patriarch) on Apr 19, 2006 at 16:36 UTC

    (.+)? (optionally match one or more characters)
    should be
    (.+?) (match one or more characters non-greedily)

    For example,

    $saved_page = <<'__EOI__'; </a><br>Haunted Woods 1.13 WC Piece<br>&nbsp;<b> </a><br>Moo Moo 1.23 WD Piece<br>&nbsp;<b> </a><br>Foo Bar 1.14 WC Piece<br>&nbsp;<b> __EOI__ push (@items, $1) while $saved_page =~ m#</a><br>(.+?)<br>&nbsp;<b>#g; print("$_\n") foreach @items;

    outputs

    Haunted Woods 1.13 WC Piece Moo Moo 1.23 WD Piece Foo Bar 1.14 WC Piece

    Update: If there can be newlines, don't forget to use /.../gs instead of /.../g.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: html regex problem
by prasadbabu (Prior) on Apr 19, 2006 at 15:33 UTC

    As marto suggested it is best to use the modules, which he has suggested. If you want to try with regex, here is one way to do it. You have to take a look at perlre

    I'm trying to match anything that appears between the two BR tags

    $saved_page = '</a><br>Haunted Woods 1.13 WC Piece<br>&nbsp;<b>'; while ($saved_page =~ m#<br>((?(?!<br>).)+)<br>#g) { push (@items, $1) } print "@items";

    Prasad