Bugorr has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks. I start my question as many others. My problem is very simple, but I do not know how to solve it.
Here's what it's:

I have a file with bunch of email addresses separated by some characters:
(excuse long line)
^Pbishop@yahho.com^H17769025^D3352^Vblueangel@accessmo.com^H17769714^D +3352^Oboe@stooges.com^H17773126^D3352^Mbirk@joke.com^H17773968^D3352^ +Rbobfitz@mcione.com^H17768877^D3352^Nbob@yohaoo.com^H17769806^D3352^R


I need to calculate number of email addresses and count top 5 domains in this file.

Please help with this little excercise.
Thank you in advance.
Bugorr

20051205 Janitored by Corion: Added code tags around sample data

Replies are listed 'Best First'.
Re: Cout & parsing
by ikegami (Patriarch) on Dec 05, 2005 at 21:30 UTC
    use strict; use warnings; use Email::Address (); my @fields; { open(my $fh_in, '<', ...) or die("Unable to open input file: $!\n"); # The input is a binary file. binmode($fh_in); for (;;) { my $buf; # Obtain the length of the field. read($fh_in, $buf='', 1) or last; $len = ord($buf); # Obtain the field. read($fh_in, $buf='', $len) or die("Bad input file\n"); push(@fields, $buf); } } @fields % 3 == 0 or die("Bad input file\n"); # Get every third field, starting with the first: my @email_addrs = do { my $n = 3; my $t = $n-1; grep !($t=($t+1)%$n), @fields }; # Convert to Email::Address objects. # Group by host. my %grouped; foreach my $email_addr (@email_addrs) { foreach my $o (Email::Address->parse($email_addr)) { $host = lc($o->host()); push(@{$grouped{$host}}, $o); } } # Determine the most common hosts. my @highest = sort { @{$grouped{$b}} <=> @{$grouped{$a}} } keys %group +ed; # Filter out unwanted. Keep ties. if (@higest >= 5) { my $min = @{$grouped{$highest[4]}}; @highest = grep { @{$grouped{$_}} >= $min } @highest; } # Print results. foreach my $host (@highest) { my $count = @{$grouped{$host}}; print("$host ($count)\n") }

    Untested.

Re: Cout & parsing
by GrandFather (Saint) on Dec 05, 2005 at 22:02 UTC

    Here's a start:

    use warnings; use strict; my $s='^Pbishop@yahho.com^H17769025^D3352^Vblueangel@accessmo.com^H177 +69714^'. 'D3352^Oboe@­stooges.com^H17773126^D3352^Mbirk@joke.com^H17773968^D3 +352^'. 'Rbobfitz@mcione.com^H­17768877^D3352^Nbob@yohaoo.com^H17769806^D335 +2^R'; $.=0; map{my($f)=/@(.*?)\./;$.++;$.{$f}++}grep{/@/}split/\^/,$s; print"Addresses: $.\n",join"\n",map{"$_: $.{$_}"}keys(%.);

    Prints:

    Addresses: 6 ­stooges: 1 joke: 1 yahho: 1 yohaoo: 1 accessmo: 1 mcione: 1

    DWIM is Perl's answer to Gödel

      ^P (for example) is not the character ^ followed by the character P. It's the single character Ctrl-P (chr 16). The OP just posted the output of a program that represent chr 16 as ^P. When the strings were stored, they were prefixed by their length. 16 is the length of bishop@yahho.com.

      The length might not always be a byte. In some libraries (such as C++ STL, I think), the size of the field varies to accomodate strings longer than 255 characters.

      Also, email addresses are not as simple as you assume. They may contain multiple "@", for starters.

        It looks like a job for unpack. Inspect the format in detail - if all the length encoding are a single byte, and there are only email addresses and numbers (as there are in your example) then unpacking the binary format makes more sense than casting to a string and regexing your way through.

        my $str = << 'EOT'; ^Pbishop@yahho.com^H17769025^D3352^Vblueangel@acc essmo.com^H17769714^D3352^Oboe@stooges.com^H17773 126^D3352^Mbirk@joke.com^H17773968^D3352^Rbobfitz @mcione.com^H17768877^D3352^Nbob@yohaoo.com^H1776 9806^D3352^R EOT # stitch and recast control characters # use the original binary format in reality $str =~ s/\s+//g; $str =~ s/\^([A-Z])/chr( ord($1) & 0xBF )/ge; my @emails = grep m/@/, unpack "(c/a)*", $str; print "$_\n" for @emails;

        If the data is indeed formatted the way it appears to be, this avoids all sorts of unpleasantness when emails are longer than 32 characters, and the lengths cease to be in the range of control characters.

Re: Cout & parsing
by dorward (Curate) on Dec 05, 2005 at 21:11 UTC

    What have you got so far? Where have you got stuck? (i.e. Please Show Some Effort).