in reply to speed up one-line "sort|uniq -c" perl code
In some fairly crude testing on a test file of 500,000 randomly generated lines (18MB) to fit the pattern of data you described, this seems to run about 400% quicker. 40 seconds rather 160.
no warnings; $^W=0; open my $in, $ARGV[0] or die "Couldn't open $ARGV[0]:$!"; my ($buffer, %h) = ''; keys %h = 1024*500; while (sysread($in, $buffer, 16384, length $buffer)) { $h{$1}++ while $buffer =~ m[^(?:.+?\|){9}([^|]+)\|]mg; $buffer = substr($buffer, rindex($buffer, "\n")); } print scalar keys %h;
You may need to adjust the regex, and you might squeeze some more out by playing with the read size and/or the hash preallocation. In my tests, I had variable results from both, but the differences were of within the bounds of run-to-run error. Especially the latter which I am not really sure how the number of buckets relates to keys.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2speed up one-line "sort|uniq -c" perl code
by relaxed137 (Acolyte) on Apr 16, 2003 at 00:28 UTC |