agoth has asked for the wisdom of the Perl Monks concerning the following question:

Got myself in a tangle earlier trying to extract some email addresses from some rather mangled HTML.

I ended up using two regex's where I'm sure one would be possible but cant find the combination required.

$data = 'any number of td/tds><td>stuff</td><td>email@email.com</td><t +d>more stfff</td><td>next@next.co.uk</td><td>r.h@a.com</td>'; my %addresses; @emails = ($data =~ /<td>(.*?\@.*?)<\/td>/g); for (@emails) { $_ =~ s/.*<td>(.*)?/$1/g; $addresses{$_} = 1; } } for (keys %addresses) { print "$_\n"; }

Any thoughts??
Cheers

Replies are listed 'Best First'.
Re: regex question (new)
by kilinrax (Deacon) on Nov 10, 2000 at 21:21 UTC
    Go and read Death to Dot Star! >;->
    The problem is that your regex, while not greedy, still matches as early as possible, causing it to match things like 'stuff</td><td>email@email.com'. If you replace the dots with a negated character classes, preventing them from matching the angle brackets of the <td> tags, then it should work perfectly:
    #!/usr/bin/perl -w use strict; my $data = 'any number of td/tds><td>stuff</td><td>email@email.com</td +><td>more stfff</td><td>next@next.co.uk</td><td\>r.h@a.com</td>'; my @emails = ($data =~ /<td>([^>\@]+?\@[^<\@]+)<\/td>/g); print join "\n", @emails;
    However, this is definitely a job for Email::Find:
    #!/usr/bin/perl -w use strict; use Email::Find; my $data = 'any number of td/tds><td>stuff</td><td>email@email.com</td +><td>more stfff</td><td>next@next.co.uk</td><td\>r.h@a.com</td>'; find_emails($data, sub { my($email, $orig_email) = @_; print $email->format."\n"; return $orig_email; });
(Ovid) Re: regex question (new)
by Ovid (Cardinal) on Nov 10, 2000 at 21:23 UTC
    Verifying e-mail addresses with a regex is not possible. It is possible to make some guesses, but a regex is not the best solution as one has little to no control over the structure of the e-mail address.

    Try Email::Find. It will extract the e-mails for you. Documentation can be found here.

    For the record, the following are all structurally valid e-mail addresses (though they don't go anywhere).

    Alfred Neuman <Neuman@BBN-TENEXA> ":sysmail"@ Some-Group. Some-Org Muhammed.(I am the greatest) Ali @(the)Vegas.WBA
    The above examples were published in CGI programming with Perl, Second Edition. If you'd like more information, check out RFC822.

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

      Thanks, but Im not even attempting to validate them at this point, I am simply trying to identify which TD pairs have emails in them to extract the addresses from the file. Its the only identifier I can even vaguely rely on......

      Update from the pod:

      item This module requires 5.005_63 or higher!

      This module runs so slow as to be unusable with 5.005 stable. I'm not sure, but it might be because I build up my search regex using lots of compiled regexes. Either way, it runs orders of magnitude faster under 5.005_63.

      that blows me out of the water for starters.....

RE: regex question (new)
by KM (Priest) on Nov 10, 2000 at 22:35 UTC
    Just to add a little to the other suggestions, you can use Email::Valid to check the MX of of the email address, to further help validate it (of course, you can't know if the email address itself is valid without sending email to it).

    Cheers,
    KM

Re: regex question (new)
by mirod (Canon) on Nov 10, 2000 at 21:29 UTC

    If you know you email adresses don't include < (which is _very_ dangerous if you allow things like Mr Bean >bean@dumb.uk>) then you can just get the addresses with:

    @emails = ($data =~ /<td>([^<@]*\@[^<]*)<\/td>/sg);

    which just gets stuff with an at sign in a td

    Otherwise grab your copy of Mastering Regular Expressions (buy it if you don't have it, it's your Friend) and look for the explanations on "unrolling the loop"

    Update: hey, I'm dumb or wot? If it's HTML shouldn't the < be escaped into &lt; anyway? In this case the regexp above would work in any case!