Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Trying to fetch all emails in my web pages. I need to get a count of ALL emails in all pages. So I need to fetch all the "@company.com" and "@aa.company.com" and "@bb.company.com" and then get the total count. Here is how the emails look with just example of "anyname" put in where all the names would be:
anyname@company.com anynamehere@aa.company.com anynameagain@bb.company.com
Here the latest I have but would like assistance on this because it doesnt seem to give me accurate counts:
use File::Find; sub wanted { if( $_ =~ /\.html?$/) { my $name = $File::Find::name; open ( F, $name ) or die "$!: $name\n"; while($line = <F>) { if($line =~ /\.(?:company?|aa\.company|bb\.company)$/) { $ct++; print "Email = $1\n"; } } close F; } } find( \&wanted, "/web/dir/" );
print "Total amount of emails found = $ct\n";

Replies are listed 'Best First'.
Re: Getting all emails
by VSarkiss (Monsignor) on Jun 19, 2003 at 16:21 UTC

    I don't know if this is just a copy-and-paste thing, but your examples are all .company.com and your regular expression wants the target to end in .company. If that's really the case, then just make the regex: /\.(?:company?|aa\.company|bb\.company)\.com$/

    Your question isn't very clear, by the way. Can you explain what the innacuracy in the count is? Are you getting some of the addresses, or none at all, or ...?

    HTH

      I rewrote as suggested and this reg expression doesnt give me any results.
      use File::Find; sub wanted { local *F; if( $_ =~ /\.html?$/) { my $name = $File::Find::name; open ( F, $name ) or die "$!: $name\n"; while($line = <F>) { if($line =~ /\.(?:company?|aa\.company|bb\.company)\.com$/i) { print "FILE = $_ email = $1\n"; } } close F; } } find( \&wanted, "/dirpath/here" );
      If I put in a reg expression like this: if($line =~ /\@company\.com/) it works but I need to really search for hits on any of the three listed above.

        The regex I showed before, because of the $ at the end, will only match if the string occurs just before a newline. So if you want to match things "in the middle of a line", just take the dollar sign out. Also, in your original question the regex started with a period, not an @ sign, and it looks like that's what you're after, so you should make that correction as well. /\@(?:company?|aa\.company|bb\.company)\.com/should do the trick. I'm still not sure why you have "company?". Is that just a typo? It will match compan or company, and it doesn't sound like you want the former. That's why I said you need to state your question more clearly.

        However, if everything I've mentioned so far is correct (you want to find strings in the middle of a line, starting with an @ sign, possibly followed by aa. or bb., then followed by company.com), then here's a simpler way to just state that: /\@(?:aa\.|bb\.)*company\.com/

Re: Getting all emails
by Zed_Lopez (Chaplain) on Jun 19, 2003 at 18:38 UTC
    my $result; for my $file (glob "*.html") { open (FH,"<$file") or die; while (<FH>) { $result += (my @matches = /@(\S+\.)?company.com/g); } close FH; } print "$result\n";
    The regexp given is fairly naive, but you can improve on it.