ok has asked for the wisdom of the Perl Monks concerning the following question:

Trying to write some code that will sort a bunch of log entries by date. It will run like this:
cat crap | mysort > sorted_crap
Since there are many log entries per second, the plan is to push each entry on to an array keyed by the unix time, then print each entry for (sort %bighash). Here's what I have:
$log_entries = {}; while ( <STDIN> ) { # Date string looky like: "06/Apr/2001:08:30:05" /(\d{2}\/\w{3}\/\d{4}\:\d{2}\:\d{2}\:\d{2})/; $secs = parsedate($&); push @{$log_entries->{$secs}}, $_; } foreach $key (sort %$log_entries) { foreach $entry (@{$log_entries->{$key}}) { print $entry; } }
This all goes to hell when the size of crap gets real big, ie, Out of memory!. Can anyone suggest a more efficient solution?

Replies are listed 'Best First'.
(jcwren) Re: memory monster
by jcwren (Prior) on Apr 07, 2001 at 01:11 UTC

    Rather than re-inventing the wheel, why not use already written tools? Like 'sort', that comes with every *nix system? And if you're on Windows, there are *plenty* of aftermarket sort tools. See TUCOWS

    'sort' supports multiple keys, so all you need to do is set up your key order for year, month, day, then time. You *may* need to munge the month into a number with a front end script, but log files should be written with sortable dates, to start with. '06/Apr/2001' is *wrong*. '2001/04/06' is correct.

    Update: Seems that sort knows about abbreviated month names. See the -M option.

    --Chris

    e-mail jcwren
Re: memory monster
by arturo (Vicar) on Apr 07, 2001 at 01:05 UTC

    Do a search on merge sort or look it up in a CS book. Basically, the technique involves sorting chunks of the file at a time, and then ... merging the results.

    But RAM is cheap at the moment, and everybody loves to solve problems by throwing money at them =)

    Philosophy can be made out of anything. Or less -- Jerry A. Fodor

Re: memory monster
by stephen (Priest) on Apr 07, 2001 at 13:16 UTC
    Just an idea off the top of my head: if 'crap' is seekable, you could take a two-phased approach-- instead of storing the log entry itself in memory, just store date and the line number. At the end, you'd have a hash of parsed dates and line numbers.

    After you'd built this up, you could sort by date, get each line number, seek to the line, parse and print it, and move to the next one. You'd never be storing the contents of the log file entries in memory, so your memory needs should drop a great deal.

    Also, I'd investigate the Tie::* modules on CPAN. There are a number of tied hashes which handle disk-based storage.

    stephen