Getting all emails

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Trying to fetch all emails in my web pages. I need to get a count of ALL emails in all pages. So I need to fetch all the "@company.com" and "@aa.company.com" and "@bb.company.com" and then get the total count. Here is how the emails look with just example of "anyname" put in where all the names would be:

anyname@company.com
anynamehere@aa.company.com
anynameagain@bb.company.com
[download]

Here the latest I have but would like assistance on this because it doesnt seem to give me accurate counts:

use File::Find;
sub wanted
{
   if( $_ =~ /\.html?$/)
   {
       my $name = $File::Find::name;
       open ( F, $name ) or die "$!: $name\n";
           while($line = <F>)
             {
             if($line =~ /\.(?:company?|aa\.company|bb\.company)$/)
             {
                 $ct++;
                 print "Email = $1\n";
             }
             }
   close F;
   }
}
find( \&wanted, "/web/dir/" );
[download]

print "Total amount of emails found = $ct\n";

Comment on Getting all emails Select or Download Code

Replies are listed 'Best First'.
Re: Getting all emails by VSarkiss (Monsignor) on Jun 19, 2003 at 16:21 UTC
I don't know if this is just a copy-and-paste thing, but your examples are all `.company.com` and your regular expression wants the target to end in `.company`. If that's really the case, then just make the regex: `/\.(?:company?\|aa\.company\|bb\.company)\.com$/` Your question isn't very clear, by the way. Can you explain what the innacuracy in the count is? Are you getting some of the addresses, or none at all, or ...? HTH	[reply] [d/l] [select]
Re: Re: Getting all emails by Anonymous Monk on Jun 19, 2003 at 17:55 UTC
I rewrote as suggested and this reg expression doesnt give me any results. `use File::Find; sub wanted { local *F; if( $_ =~ /\.html?$/) { my $name = $File::Find::name; open ( F, $name ) or die "$!: $name\n"; while($line = <F>) { if($line =~ /\.(?:company?\|aa\.company\|bb\.company)\.com$/i) { print "FILE = $_ email = $1\n"; } } close F; } } find( \&wanted, "/dirpath/here" );` [download] If I put in a reg expression like this: `if($line =~ /\@company\.com/)` it works but I need to really search for hits on any of the three listed above.	[reply] [d/l] [select]
Re (3): Getting all emails by VSarkiss (Monsignor) on Jun 19, 2003 at 18:53 UTC
The regex I showed before, because of the `$` at the end, will only match if the string occurs just before a newline. So if you want to match things "in the middle of a line", just take the dollar sign out. Also, in your original question the regex started with a period, not an @ sign, and it looks like that's what you're after, so you should make that correction as well. `/\@(?:company?\|aa\.company\|bb\.company)\.com/`should do the trick. I'm still not sure why you have "`company?`". Is that just a typo? It will match `compan` or `company`, and it doesn't sound like you want the former. That's why I said you need to state your question more clearly. However, if everything I've mentioned so far is correct (you want to find strings in the middle of a line, starting with an `@` sign, possibly followed by `aa.` or `bb.`, then followed by `company.com`), then here's a simpler way to just state that: `/\@(?:aa\.\|bb\.)*company\.com/`	[reply] [d/l] [select]
Re: Re (3): Getting all emails by Anonymous Monk on Jun 19, 2003 at 19:49 UTC
Re: Getting all emails by Zed_Lopez (Chaplain) on Jun 19, 2003 at 18:38 UTC
`my $result; for my $file (glob "*.html") { open (FH,"<$file") or die; while (<FH>) { $result += (my @matches = /@(\S+\.)?company.com/g); } close FH; } print "$result\n";` [download] The regexp given is fairly naive, but you can improve on it.	[reply] [d/l]