natol44 has asked for the wisdom of the Perl Monks concerning the following question:

Hello!

I want to extract email addresses from fetched webpages. Pages are in php. I wrote the below script, the problem is that it returns not ALL addresses (exactly 16, though there are more than one hundred on the page). Thank you in advance for your help and... Merry Xmas for those concerned :)


#!/usr/bin/perl

my $email_count;
my $readfile="path/hasbeenfetched.php";

open (READFILE, "<$readfile"); @all=<READFILE>; close (READFILE);
foreach my $line(@all) {
foreach my $email (split /\s+/, $line) {
if ( $email =~ /^-\w.+@(a-z0-9a-z-0-9+\.)+a-z{2,4}$/i ) {
print $email . "\n";
$email_count++;
}
}
}

print "Emails Extracted: $email_count\n";

1;

Replies are listed 'Best First'.
Re: Extracting email addresses
by moritz (Cardinal) on Dec 25, 2011 at 09:03 UTC

    There are several problems with your code, for example that you search for only one email address per line, and that your definition of an email address is quite questionable.

    You can use CPAN modules like Email::Find, which do most of the work for you, and probably don't suffer from these problems.

      The OP actually splits each line on whitespace, and tests each chunk.

      Unfortunally, email addresses can contain whitespace. For instance <foo @ example.com> is a legal way of formatting an address - and so are <"bar baz"@example.com> and <quux(Yes, that is me!)@example.com>. Of course, the email addresses of the OPs data may be whitespace free.

      Anyway, the Perl tarbal actually comes with a regular expressions to match email addresses. It's found in the test suite (t/re/reg_email.t), and requires 5.10:

      my $email = qr { (?(DEFINE) (?<address> (?&mailbox) | (?&group)) (?<mailbox> (?&name_addr) | (?&addr_spec)) (?<name_addr> (?&display_name)? (?&angle_addr)) (?<angle_addr> (?&CFWS)? < (?&addr_spec) > (?&CFWS)?) (?<group> (?&display_name) : (?:(?&mailbox_list) | (?& +CFWS))? ; (?&CFWS)?) (?<display_name> (?&phrase)) (?<mailbox_list> (?&mailbox) (?: , (?&mailbox))*) (?<addr_spec> (?&local_part) \@ (?&domain)) (?<local_part> (?&dot_atom) | (?&quoted_string)) (?<domain> (?&dot_atom) | (?&domain_literal)) (?<domain_literal> (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?& +FWS)? \] (?&CFWS)?) (?<dcontent> (?&dtext) | (?&quoted_pair)) (?<dtext> (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e]) (?<atext> (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|} +~]) (?<atom> (?&CFWS)? (?&atext)+ (?&CFWS)?) (?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)?) (?<dot_atom_text> (?&atext)+ (?: \. (?&atext)+)*) (?<text> [\x01-\x09\x0b\x0c\x0e-\x7f]) (?<quoted_pair> \\ (?&text)) (?<qtext> (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e]) (?<qcontent> (?&qtext) | (?&quoted_pair)) (?<quoted_string> (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent +))* (?&FWS)? (?&DQUOTE) (?&CFWS)?) (?<word> (?&atom) | (?&quoted_string)) (?<phrase> (?&word)+) # Folding white space (?<FWS> (?: (?&WSP)* (?&CRLF))? (?&WSP)+) (?<ctext> (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e +]) (?<ccontent> (?&ctext) | (?&quoted_pair) | (?&comment)) (?<comment> \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) ) (?<CFWS> (?: (?&FWS)? (?&comment))* (?: (?:(?&FWS)? (?&comment)) | (?&FWS))) # No whitespace control (?<NO_WS_CTL> [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]) (?<ALPHA> [A-Za-z]) (?<DIGIT> [0-9]) (?<CRLF> \x0d \x0a) (?<DQUOTE> ") (?<WSP> [\x20\x09]) ) (?&address) }x;
Re: Extracting email addresses
by TJPride (Pilgrim) on Dec 25, 2011 at 20:02 UTC
    Just thought I'd mention that people who farm emails are generally up to no good and we probably shouldn't be helping with this particular problem. The world already has enough spam.
Re: Extracting email addresses
by CountZero (Bishop) on Dec 26, 2011 at 03:08 UTC
    You will be better and faster helped when you show us the data you are working on

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Indeed, please post url or static html as code.