regex question (new)

agoth has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regex question (new) by kilinrax (Deacon) on Nov 10, 2000 at 21:21 UTC
Go and read Death to Dot Star! >;-> The problem is that your regex, while not greedy, still matches as early as possible, causing it to match things like '`stuff</td><td>email@email.com`'. If you replace the dots with a negated character classes, preventing them from matching the angle brackets of the `<td>` tags, then it should work perfectly: `#!/usr/bin/perl -w use strict; my $data = 'any number of td/tds><td>stuff</td><td>email@email.com</td +><td>more stfff</td><td>next@next.co.uk</td><td\>r.h@a.com</td>'; my @emails = ($data =~ /<td>([^>\@]+?\@[^<\@]+)<\/td>/g); print join "\n", @emails;` [download] However, this is definitely a job for Email::Find: `#!/usr/bin/perl -w use strict; use Email::Find; my $data = 'any number of td/tds><td>stuff</td><td>email@email.com</td +><td>more stfff</td><td>next@next.co.uk</td><td\>r.h@a.com</td>'; find_emails($data, sub { my($email, $orig_email) = @_; print $email->format."\n"; return $orig_email; });` [download]	[reply] [d/l] [select]
(Ovid) Re: regex question (new) by Ovid (Cardinal) on Nov 10, 2000 at 21:23 UTC
Verifying e-mail addresses with a regex is not possible. It is possible to make some guesses, but a regex is not the best solution as one has little to no control over the structure of the e-mail address. Try Email::Find. It will extract the e-mails for you. Documentation can be found here. For the record, the following are all structurally valid e-mail addresses (though they don't go anywhere). `Alfred Neuman <Neuman@BBN-TENEXA> ":sysmail"@ Some-Group. Some-Org Muhammed.(I am the greatest) Ali @(the)Vegas.WBA` [download] The above examples were published in CGI programming with Perl, Second Edition. If you'd like more information, check out RFC822. Cheers, Ovid Join the Perlmonks Setiathome Group or just go the the link and check out our stats.	[reply] [d/l]
RE: (Ovid) Re: regex question (new) by agoth (Chaplain) on Nov 10, 2000 at 21:31 UTC
Thanks, but Im not even attempting to validate them at this point, I am simply trying to identify which TD pairs have emails in them to extract the addresses from the file. Its the only identifier I can even vaguely rely on...... Update from the pod: item This module requires 5.005_63 or higher! This module runs so slow as to be unusable with 5.005 stable. I'm not sure, but it might be because I build up my search regex using lots of compiled regexes. Either way, it runs orders of magnitude faster under 5.005_63. that blows me out of the water for starters.....	[reply]
RE: regex question (new) by KM (Priest) on Nov 10, 2000 at 22:35 UTC
Just to add a little to the other suggestions, you can use Email::Valid to check the MX of of the email address, to further help validate it (of course, you can't know if the email address itself is valid without sending email to it). Cheers, KM	[reply]
Re: regex question (new) by mirod (Canon) on Nov 10, 2000 at 21:29 UTC
If you know you email adresses don't include < (which is _very_ dangerous if you allow things like `Mr Bean >bean@dumb.uk>`) then you can just get the addresses with: `@emails = ($data =~ /<td>([^<@]\@[^<])<\/td>/sg);` which just gets stuff with an at sign in a td Otherwise grab your copy of Mastering Regular Expressions (buy it if you don't have it, it's your Friend) and look for the explanations on "unrolling the loop" Update: hey, I'm dumb or wot? If it's HTML shouldn't the < be escaped into < anyway? In this case the regexp above would work in any case!	[reply] [d/l]