Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm very new to regexes and I know there are modules that can do this, specifically some HTTP modules, but to do things the hard way and to better learn to use regexes, I'd rather help with it this way instead of using a module for it.

I know how to apply one regex to a variable and either s/// one or many things, but I only know how to match ONE thing from it.

What I want to do is take a page from LWP and get all the links that look like page.php?id=379744077, or page.php?id=#########.

Each page will have TONS of matches, but I'm not sure how to handle this dynamically.

If it's possible, it'd be nice storing all the matches inside an array.

use LWP::Simple; my start = "link.here"; my $content = get($start); $content =~ m//# ?

Replies are listed 'Best First'.
Re: newbie to regex: all matches
by graff (Chancellor) on Mar 10, 2006 at 02:01 UTC
    Others have alluded to the issues and better ways for handling html data, but then again, your particular task looks pretty basic, and anyway the real question is "how to use a regex match in such a way that we handle every instance of a pattern in a given string?"

    The basic answer to that is the "g" modifier, placed at the end of the regex. Some examples:

    $_ = <<ETX; Long text string foo_1... with multiple foo_2 lines... and who cares foo_3 what else... ETX while ( /foo_\d+(.*)/g ) { # loop over every match print "found '$1' between foo and end-of-line\n"; } s/foo_/bar=/g; # replace all occurrences of foo with bar my @bars = ( /(bar=\d+)/g ); # capture all occurrences to an array print "@bars\n";
Re: newbie to regex: all matches
by GrandFather (Saint) on Mar 10, 2006 at 00:32 UTC

    Don't use regexen for parsing HTML - it leads to an unhappy life! Instead use modules like HTML::TreeBuilder. Look for the look_down method of element.


    DWIM is Perl's answer to Gödel
      I know all about using that module but I want to do it myself since I'll get experience using regexes in this way. It's not much of a script that will ever do anything, it's just for playing and testing. So if it doesn't always match id the page changes, it doesn't affect anything. I just want to learn how to match every occurence into an array.

      Thanks.

        use warnings; use strict; my $str = do {local $/; <DATA>}; my @matches = $str =~ /(\w+):(.*)/g; print "@matches\n"; __DATA__ 1: one 2: two 3: three

        Prints:

        1 one 2 two 3 three

        DWIM is Perl's answer to Gödel
Re: newbie to regex: all matches
by Anonymous Monk on Mar 10, 2006 at 00:32 UTC
    Pass a closure to HTML::LinkExtractor which fills an array by /page[.]php[?]id=(\d+)/
Re: newbie to regex: all matches
by Anonymous Monk on Mar 10, 2006 at 00:13 UTC
    I tried the following but it makes the script hand indefinitely.
    my $content = get($start); my @matches; while ($content) { push(@matches, $1) if ($content =~ m/(page.php?id=\ +d+)/)} print @matches;
      Be sure to \? before your ? as that's a special character.

      UPDATE adding code

      my @matches = ($content =~ m/(image.php\?id=\d+)/g);

      UPDATE 2 Fixed typo of /? to \? thanks to GrandFather



      "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

      sulfericacid
      This will run forever because the while loop will run as long as $content is true. Nothing in the loop changes $content, so assuming it's true to begin with, the loop never ends.