JJB has asked for the wisdom of the Perl Monks concerning the following question:

I am using this statement

if (/(abuse\@.*?)\s/)

This works fine as long as the email address is surrounded by white space. But sometimes the email address is enclosed with special characters, like this <abuse@tel2.con.it> or [abuse@bog-cnn.jp] or \abuse@fars.stores.info\

((abuse\@.*?)\s/) captures the ending non-alpha characters like this abuse@tel2.con.it> which I don’t want.

I tried this to tighten it up (/(abuse\@[\w\.-_]+)\s/) but it finds nothing.

What is correct syntax??

20040620 Edit by Corion: Added formatting

Replies are listed 'Best First'.
Re: Capture Email address
by davido (Cardinal) on Jun 20, 2004 at 17:15 UTC

    Since the regexp to match an email address can become grotesquely complex (and still not work 100% of the time), you might as well utilize the work that's already been done on the subject... Email::Find.

    The description of that module is: "Find RFC822 email addresses in plain text."


    Dave

      I second the Email::Find recommendation. I've used it in the past to do exactly the same kind of thing the OP is describing.

      -- vek --
Re: Capture Email address
by muba (Priest) on Jun 20, 2004 at 17:12 UTC
    Find out what characters are allowed in domain names. IIRC, they are a-z, 0-9, ".", "_" and "-". Make a character class of it:
    if ( /(abuse\@[a-zA-Z0-9\.\-]+)/ )


    Note: untested
Re: Capture Email address
by graff (Chancellor) on Jun 21, 2004 at 02:18 UTC
    Your "tightened up" version ( /(abuse\@[\w\.-_]+)\s/) has a couple of problems, and you should read enough of "perlre" man page to understand them.

    The character-class (the part between square brackets) is actually defining a range that includes all characters in the ascii set between the period (0x2e) and underscore (0x5f), which would still include backslash, square brackets and angle brackets. I think you intended to do something like this:

    /(abuse\@[-.\w]+)/
    Note that \w covers alphanumerics and underscore, the dash needs to go first (or last) so that it doesn't create a range, the period does not need to be escaped in a character class definition, and you don't need/want to force a match of a following white-space character at the end of the regex.

    In any case, using a module as mentioned in previous replies is likely to be the better way to go.