Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

My request seems incredibly simple, but I'm at a loss. I'd like to know how I can store all of the strings that a regexp matches into an array. For example, my specific task was to extract e-mail addresses from a given page (don't worry, it's not for spam). Although I was given a suggestion to split everything up and print if it had an @ character, that screamed inefficiency... I did, however, find in the perldocs something about @-, but that seemed way too complicated for something so basic.

Replies are listed 'Best First'.
Re: Capturing RegExp Matches
by Abigail-II (Bishop) on Jul 03, 2002 at 14:06 UTC
    Parsing email addresses out of text is far from trivial. *Any* ASCII character can be part of an email address, including NUL characters, white space and control characters. Here are some examples of valid email addresses:
    *@example.net "\""@foo.bar fred&barny@example.com ---@example.com foo-bar@example.net "127.0.0.1"@[127.0.0.1] Muhammed.(I am the greatest) Ali @(the)Vegas.WBA ':; $@[] *()@[]

    As for the general question, "how do I store all strings that a regex matches", just use the regex in list context, with a /g modifier. If you have capturing parens in your regex, you'll have to put a set of parens around the whole regex, and filter out the submatches (but it's probably easier to turn the capturing parens into non-capturing).

    Abigail

Re: Capturing RegExp Matches
by Chady (Priest) on Jul 03, 2002 at 13:29 UTC

    Your quest will get harder and harder, unless, you invest your power in any of the Email::* modules you can easily find at CPAN.

    They are all written and tested by people who know what they are doing, and they will save you a lot of debugging.


    He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

    Chady | http://chady.net/
Re: Capturing RegExp Matches
by arturo (Vicar) on Jul 03, 2002 at 13:38 UTC

    The general answer to this sort of problem is to use capturing parentheses in your regex in combination with the /g operator:

    # assume $regex has been built with qr// and matches what you want my $regex = qr/foobar/; # assuming $data contains all the data you want to scan from my @things_im_interested_in = ( $data =~ /($regex)/g ); # or, if it isn't and you are doing this from multiple data # strings push @things_im_interested_in, $data_chunk =~/($regex)/g;

    The /g modifier returns all the matches in the string, so those match operators return a list of the matching parts of the string.

    However, as a read through the Perl FAQ (or even on this site for, say "Email Address") will reveal, using regexes to match email addresses is a dicey issue in the first place.

    I mistrust all systematizers and avoid them. The will to a system shows a lack of integrity -- F. Nietzsche

Re: Capturing RegExp Matches
by dda (Friar) on Jul 03, 2002 at 13:27 UTC
    Are you talking about something like that:
    #!/usr/bin/perl -w use strict; my $page = <<__EOT; jksjdsjk some\@one.com nnbcx jdsjl;'pejbkscd (aaa\@bbb.com) sdkmlsd __EOT my @emails = ($page =~ /\b\S+?\@\S+?\b/gs); foreach (@emails) { print "$_\n"; }

    --dda

      Your code is choking on it's own data.

      you are matching the word boundary . before the .com and it's getting stripped out.

      you cannot filter an email address with one simple regex.


      He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

      Chady | http://chady.net/
        Ohh, stupid me :) Thanks.

        --dda