Re: Sorting and counting 5 Million lines

If you have the option, it may be worth comparing a unix shell solution:

sed -e 's/,/ /' < data | sort | uniq -c
[download]

A Perl solution may, in the simplest solution (hash), take up an obnoxious amount of memory. The OS version of sort under unix uses temporary files to solve the "sort this large chunk of data" problem, and may scale better. A lot of it depends on the distribution of your keys.

Update: Given the updated requirements, the shell oneliner is no longer appropriate. If there is a memory constraint issue, look for something like DBM::Deep.

--MidLifeXis

Comment on Re: Sorting and counting 5 Million lines Download Code

Replies are listed 'Best First'.
Re^2: Sorting and counting 5 Million lines by runrig (Abbot) on Mar 17, 2011 at 15:03 UTC
It's a long "one-liner" (broken up for readability), but this I think meets the requirements: `sort <<EOT \| tr ',' ' ' \| uniq -c \| awk '{ if ( tag != $2 ) { if ( tot > 0 ) { print tot, tag, countries } tot=$1; tag=$2; countries=$3; next; } tot += $1; countries = countries "," $3; } END { print tot, tag, countries }' abcd,US abee,UK abcd,US adee,US adee,UK EOT` [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: Sorting and counting 5 Million lines
by runrig (Abbot) on Mar 17, 2011 at 15:03 UTC


sort <<EOT | tr ',' ' ' | uniq -c | awk '{
  if ( tag != $2 ) {
    if ( tot > 0 ) { print tot, tag, countries }
    tot=$1; tag=$2; countries=$3;
    next;
  }
  tot += $1;
  countries = countries "," $3;
}
END { print tot, tag, countries }'
abcd,US
abee,UK
abcd,US
adee,US
adee,UK
EOT
[download]

[reply]
[d/l]