Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Greeting monks,

I am trying to extract an email address from a log file that has other information in it. When I run the following code:
#!perl -w open(TRACKDATA, "> C:\\myData") or die "can't open data"; open(TRACKLOG, "C:\\myLog,txt") or die "can't open log"; while (<TRACKLOG>) { chomp; ($email) = /\s(.*@.*)\s/; print TRACKDATA "$email\n"; }

It gets the email address but it also returns everything after it. Its been a while since I've worked with regular expressions but I thought you could capture everthing within "()" and stop it with characters berfor and after. What am I missing. Thanks in advance for any prayers or meditations.

Replies are listed 'Best First'.
Re: RegEx for email help
by hardburn (Abbot) on Jan 13, 2004 at 17:29 UTC

    The reason your current regex won't work is that the astrix is greedy, i.e. it grabs everything it can. Since it is allowed to match anything, it grabs everything except the last bit of whitespace (which is required for the regex to match at all).

    More of an overall issue is that matching an e-mail address is quite a bit harder than most people think. See Email::Valid, which contains the generally accepted regex for matching e-mail addresses (it's several thousand characters long, and it doesn't even match emebedded comments, as allowed by RFC 822).

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    : () { :|:& };:

    Note: All code is untested, unless otherwise stated

      I concur with all of the comments made hardburn above - Matching email addresses is a much more complex task than what most people realise. Generally however I lean towards the use of Email::Valid::Loose in place of Email::Valid as this allows for better matching as per RFC2822, which supercedes RFC 822, and permits the . (period) character in the local-part portion of the email address.

      Additionally, depending upon your matching requirements, it may be worth modifying URI::Find to employ the regular expression from Email::Valid::Loose above ($Email::Valid::Loose::Addr_spec_re) to be employed for matching ($URI::scheme_re).

       

      perl -le "print unpack'N', pack'B32', '00000000000000000000001010101011'"

Re: RegEx for email help
by Roy Johnson (Monsignor) on Jan 13, 2004 at 17:33 UTC
    It's a greedy-match problem. If you don't want to match whitespace, say so:($email) = /(\S+\@\S+)/. There are more rigorous ways to check email addresses, but assuming you just need to find one set off by spaces, this should do it.

    The PerlMonk tr/// Advocate