Regex Tuning

webadept has asked for the wisdom of the Perl Monks concerning the following question:

I created a script which extracts want adds from various job websites (monster, san diego union) etc.

It gets the add from the site search engine, stores it in mysql and grabs the email address from the add, so that after I read it, and find myself interested, I can easly send the resume.

To do this later I use this regex

  @a_emails=($html =~ m/([\w-.%]+\@[\w.-]+)/sig);
[download]

which works great, and its pretty fast. I was just wondering if anyone had anything better to do the same task. I find that my regex skills aren't as good as some.

thanks

webadept.net

Comment on Regex Tuning Download Code

Replies are listed 'Best First'.

Re: Regex Tuning
by grep (Monsignor) on Mar 02, 2002 at 06:12 UTC

Email::Valid

/\S@\S/

RFC822

Standard for the Format of ARPA Internet Text Messages

grep

grep> grep clue /home/users/*

[reply]
[d/l]

Re: Regex Tuning
by brianarn (Chaplain) on Mar 02, 2002 at 16:18 UTC

^nd

sub validate_email_address {
  my $addr_to_check = shift;
  $addr_to_check =~ s/("(?:[^"\\]|\\.)*"|[^\t "]*)[ \t](/$1/g;

  my $esc         = '\\\\';
  my $space       = '\040';
  my $ctrl        = '\000-\037';
  my $dot         = '\.';
  my $nonASCII    = '\x80-\xff';
  my $CRlist      = '\012\015';
  my $letter      = 'a-zA-Z';
  my $digit       = '\d';
  my $atom_char   = qq{ [^$space<>\@,;:".\\[\\]$esc$ctrl$nonASCII]};
  my $atom        = qq{ $atom_char+ };
  my $byte        = qq{ (?: 1?$digit?$digit |
                            2[0-4]$digit    |
                            25[0-5]         ) };
  my $qtext       = qq{ [^$esc$nonASCII$CRlist"] };
  my $quoted_pair = qq{ $esc [^$nonASCII] };
  my $quoted_str  = qq{ " (?: $qtext | $quoted_pair )* " };
  my $word        = qq{ (?: $atom | $quoted_str ) };
  my $ip_address  = qq{ \\[ $byte (?: $dot $byte ){3} \\] };
  my $sub_domain  = qq{ [$letter$digit]
                        [$letter$digit-]{0,61} [$letter$digit]};
  my $top_level   = qq{ (?: $atom_char ){2,4} };
  my $domain_name = qq{ (?: $sub_domain $dot )+ $top_level };
  my $domain      = qq{ (?: $domain_name | $ip_address ) };
  my $local_part  = qq{ $word (?: $dot $word )* };
  my $address     = qq{ $local_part \@ $domain };

  return $addr_to_check =~ /^$address$/ox ? $addr_to_check : "";
}
[download]

{2,4}

$top_level

~Brian

[reply]
[d/l]
[select]

Re: Re: Regex Tuning

by Matts (Deacon) on Mar 02, 2002 at 17:15 UTC

can easily be expanded by adjusting the {2,4} range in $top_level.

Important to do that since .museum is now a valid TLD.

[reply]