Cout & parsing

Bugorr has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Cout & parsing by ikegami (Patriarch) on Dec 05, 2005 at 21:30 UTC
use strict; use warnings; use Email::Address (); my @fields; { open(my $fh_in, '<', ...) or die("Unable to open input file: $!\n"); # The input is a binary file. binmode($fh_in); for (;;) { my $buf; # Obtain the length of the field. read($fh_in, $buf='', 1) or last; $len = ord($buf); # Obtain the field. read($fh_in, $buf='', $len) or die("Bad input file\n"); push(@fields, $buf); } } @fields % 3 == 0 or die("Bad input file\n"); # Get every third field, starting with the first: my @email_addrs = do { my $n = 3; my $t = $n-1; grep !($t=($t+1)%$n), @fields }; # Convert to Email::Address objects. # Group by host. my %grouped; foreach my $email_addr (@email_addrs) { foreach my $o (Email::Address->parse($email_addr)) { $host = lc($o->host()); push(@{$grouped{$host}}, $o); } } # Determine the most common hosts. my @highest = sort { @{$grouped{$b}} <=> @{$grouped{$a}} } keys %group +ed; # Filter out unwanted. Keep ties. if (@higest >= 5) { my $min = @{$grouped{$highest[4]}}; @highest = grep { @{$grouped{$_}} >= $min } @highest; } # Print results. foreach my $host (@highest) { my $count = @{$grouped{$host}}; print("$host ($count)\n") } [download] Untested.	[reply] [d/l]
Re: Cout & parsing by GrandFather (Saint) on Dec 05, 2005 at 22:02 UTC
Here's a start: `use warnings; use strict; my $s='^Pbishop@yahho.com^H17769025^D3352^Vblueangel@accessmo.com^H177 +69714^'. 'D3352^Oboe@stooges.com^H17773126^D3352^Mbirk@joke.com^H17773968^D3 +352^'. 'Rbobfitz@mcione.com^H17768877^D3352^Nbob@yohaoo.com^H17769806^D335 +2^R'; $.=0; map{my($f)=/@(.*?)\./;$.++;$.{$f}++}grep{/@/}split/\^/,$s; print"Addresses: $.\n",join"\n",map{"$_: $.{$_}"}keys(%.);` [download] Prints: `Addresses: 6 stooges: 1 joke: 1 yahho: 1 yohaoo: 1 accessmo: 1 mcione: 1` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Cout & parsing by ikegami (Patriarch) on Dec 05, 2005 at 22:28 UTC
`^P` (for example) is not the character `^` followed by the character `P`. It's the single character Ctrl-P (`chr 16`). The OP just posted the output of a program that represent `chr 16` as `^P`. When the strings were stored, they were prefixed by their length. 16 is the length of `bishop@yahho.com`. The length might not always be a byte. In some libraries (such as C++ STL, I think), the size of the field varies to accomodate strings longer than 255 characters. Also, email addresses are not as simple as you assume. They may contain multiple "@", for starters.	[reply] [d/l] [select]
Re^3: Cout & parsing by fishbot_v2 (Chaplain) on Dec 06, 2005 at 00:25 UTC
It looks like a job for `unpack`. Inspect the format in detail - if all the length encoding are a single byte, and there are only email addresses and numbers (as there are in your example) then unpacking the binary format makes more sense than casting to a string and regexing your way through. `my $str = << 'EOT'; ^Pbishop@yahho.com^H17769025^D3352^Vblueangel@acc essmo.com^H17769714^D3352^Oboe@stooges.com^H17773 126^D3352^Mbirk@joke.com^H17773968^D3352^Rbobfitz @mcione.com^H17768877^D3352^Nbob@yohaoo.com^H1776 9806^D3352^R EOT # stitch and recast control characters # use the original binary format in reality $str =~ s/\s+//g; $str =~ s/\^([A-Z])/chr( ord($1) & 0xBF )/ge; my @emails = grep m/@/, unpack "(c/a)*", $str; print "$_\n" for @emails;` [download] If the data is indeed formatted the way it appears to be, this avoids all sorts of unpleasantness when emails are longer than 32 characters, and the lengths cease to be in the range of control characters.	[reply] [d/l] [select]
Re^4: Cout & parsing by ikegami (Patriarch) on Dec 06, 2005 at 00:58 UTC
Re: Cout & parsing by dorward (Curate) on Dec 05, 2005 at 21:11 UTC
What have you got so far? Where have you got stuck? (i.e. Please Show Some Effort).	[reply]