in reply to The need for speed
First off all, kudos to you for providing some data and especially having taken the time to sanitise it. ++ for that alone.
So, you're sorting 1 000 000 records in 30 seconds? That isn't too shabby, you know? Especially given their length. Depending on how much RAM your machine has, you may be paging out to disk. As I mentioned in a response to a similar question at Fast/Efficient Sort for Large Files, for that kind of volume you want to evaluate replacing Perl's sort by a call to the external sort command.
Sorting textual IP addresses is non-trivial: you'll have to write a preprecessor that takes the log, isolates the IP address, packs it with inet_aton and prepends it to the line and writes it out. You can then sort the file with no special command line switches. Consider it a meta-Guttman-Rosler Transform if you will.
You've switched off file buffering with $| = 1. You ought to remove that line and see if it makes a difference.
Other than that, there's nothing really glaring that you seem to be doing wrong. Not too sure about your choice of variable names (e.g. $tmp_c), it's hard to guess its purpose. I have a hard time fathoming the purpose of all those grep chains.
I'm not sure if the records lend themselves to the following approach, but if a record can only be counted in one single way, you could write a splitter filter that takes the one input file, opens up as many output files as there are categories, and writes the record to the correct category file. Then you write a series of filters to deal with the separate files.
There are a number of advantages to this approach. All the consistency checking goes in the splitter. The downstream filters need less error checking in them as they're dealing with a restricted range of records. If you find a bug in one category, you only have to fix its reporting script and rerun it, rather than running the whole batch. And smaller files put less of a strain on the system, you might pick up a few seconds here and there.
Categories might be number of recipients / inbound / outbound / garbage. Note that depending on which dimensions you choose, a record could be written to more than one category file. E.g. a message sent to both internal and external recipients.
When you split out to different files, you should strip out all extraneous data you don't want to play with. This means that in the reporting scripts you won't have as large a dataset flowing across your bus. Less I/O will improve your score.
|
|---|