Dear Monks

sivaraman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Dear Monks by bart (Canon) on Mar 16, 2011 at 12:51 UTC
Like this. `my %count; while(<DATA>) { chomp; tr/,/\t/; $count{$_}++; } foreach(sort keys %count) { print "$count{$_}\t$_\n"; } __DATA__ abcd,US abee,UK abcd,US adee,US` [download] Replace `<DATA>` with "`<>`" to run it for real, and either pipe in the data file or pass the filename as an argument to the script.	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Sorting and counting 5 Million lines by MidLifeXis (Monsignor) on Mar 16, 2011 at 13:05 UTC
If you have the option, it may be worth comparing a unix shell solution: `sed -e 's/,/ /' < data \| sort \| uniq -c` [download] A Perl solution may, in the simplest solution (hash), take up an obnoxious amount of memory. The OS version of sort under unix uses temporary files to solve the "sort this large chunk of data" problem, and may scale better. A lot of it depends on the distribution of your keys. Update: Given the updated requirements, the shell oneliner is no longer appropriate. If there is a memory constraint issue, look for something like DBM::Deep. --MidLifeXis	[reply] [d/l]
Re^2: Sorting and counting 5 Million lines by runrig (Abbot) on Mar 17, 2011 at 15:03 UTC
It's a long "one-liner" (broken up for readability), but this I think meets the requirements: `sort <<EOT \| tr ',' ' ' \| uniq -c \| awk '{ if ( tag != $2 ) { if ( tot > 0 ) { print tot, tag, countries } tot=$1; tag=$2; countries=$3; next; } tot += $1; countries = countries "," $3; } END { print tot, tag, countries }' abcd,US abee,UK abcd,US adee,US adee,UK EOT` [download]	[reply] [d/l]
Re: Dear Monks by locked_user sundialsvc4 (Abbot) on Mar 16, 2011 at 15:44 UTC
“Holy `COBOL`, Batman! I do believe that there is Yet Another Way To Do It!” Consider sorting the input file. (And please don’t imagine that “five million lines” is really “a particularly large file” for a modern digital computer. The job will be done in a small handful of seconds.) When you do that, now all occurrences of all records which have the same key will be adjacent, so that you can design the entire algorithm to work by comparing “the record that we have just read from the input file” to “a saved copy of the previous record, if any, that we have read.” Now, suddenly, the algorithm does not require obnoxious amounts of memory ... in fact, it requires hardly any memory at all. You only need enough memory to hold two records: this one, and the preceding one. For more information about this technique ... which, by the way, is considerably older than digital computers ... google the term, “unit record processing.” This is what IBM was doing with all those punched-cards, and this is also what was being done in all those sci-fi movies with the rapidly spinning tapes. Of course, the original onus for doing things this way was because you didn’t have enough memory (or computing horsepower) available to do it any other way. Nevertheless, “old” though the idea might be, it is still one of the most powerful concepts in data processing, and it is extremely relevant (although usually overlooked) today. Dr. Knuth called one of his volumes, Sorting and Searching, for a very solid reason.