webadept has asked for the wisdom of the Perl Monks concerning the following question:

I created a script which extracts want adds from various job websites (monster, san diego union) etc.

It gets the add from the site search engine, stores it in mysql and grabs the email address from the add, so that after I read it, and find myself interested, I can easly send the resume.

To do this later I use this regex

@a_emails=($html =~ m/([\w-.%]+\@[\w.-]+)/sig);


which works great, and its pretty fast. I was just wondering if anyone had anything better to do the same task. I find that my regex skills aren't as good as some.

thanks

webadept.net

Replies are listed 'Best First'.
Re: Regex Tuning
by grep (Monsignor) on Mar 02, 2002 at 06:12 UTC
    Well that will work well for extracting most (non-uucp style) email-like strings, but I would run those results through Email::Valid to make sure it is really valid. Also I would make the regex a little more liberal like /\S@\S/.

    If you want you can read RFC822 Standard for the Format of ARPA Internet Text Messages. Which will explain why I say be very liberal (you can throw about any silly character in an email address :) ).

    grep
    grep> grep clue /home/users/*
Re: Regex Tuning
by brianarn (Chaplain) on Mar 02, 2002 at 16:18 UTC
    Because I'm in such a great mood today, I'll share with you a code sample from my wonderful "CGI Programming With Perl, 2nd edition" book from O'reilly - this book is great if you intend to do a lot of CGI stuff, it really helped me get started and helps me on a regular basis. :)
    sub validate_email_address { my $addr_to_check = shift; $addr_to_check =~ s/("(?:[^"\\]|\\.)*"|[^\t "]*)[ \t](/$1/g; my $esc = '\\\\'; my $space = '\040'; my $ctrl = '\000-\037'; my $dot = '\.'; my $nonASCII = '\x80-\xff'; my $CRlist = '\012\015'; my $letter = 'a-zA-Z'; my $digit = '\d'; my $atom_char = qq{ [^$space<>\@,;:".\\[\\]$esc$ctrl$nonASCII]}; my $atom = qq{ $atom_char+ }; my $byte = qq{ (?: 1?$digit?$digit | 2[0-4]$digit | 25[0-5] ) }; my $qtext = qq{ [^$esc$nonASCII$CRlist"] }; my $quoted_pair = qq{ $esc [^$nonASCII] }; my $quoted_str = qq{ " (?: $qtext | $quoted_pair )* " }; my $word = qq{ (?: $atom | $quoted_str ) }; my $ip_address = qq{ \\[ $byte (?: $dot $byte ){3} \\] }; my $sub_domain = qq{ [$letter$digit] [$letter$digit-]{0,61} [$letter$digit]}; my $top_level = qq{ (?: $atom_char ){2,4} }; my $domain_name = qq{ (?: $sub_domain $dot )+ $top_level }; my $domain = qq{ (?: $domain_name | $ip_address ) }; my $local_part = qq{ $word (?: $dot $word )* }; my $address = qq{ $local_part \@ $domain }; return $addr_to_check =~ /^$address$/ox ? $addr_to_check : ""; }
    I know it's not the easiest of things to read when you're new, but hey - it'll validate just about any legal mail you can throw at it, and it even supports someone@somewhere.info where the end is more than the normal 2 or 3 chars like .uk or .com etc. It currently supports from 2-4 characters, but can easily be expanded by adjusting the {2,4} range in $top_level.

    HTH,
    ~Brian
      can easily be expanded by adjusting the {2,4} range in $top_level.

      Important to do that since .museum is now a valid TLD.