These five lines were whipped up to sort a massive file on a date field styled MM/DD/YY. It's based on a cross between a GRT and ST, with my own little enhancement. Rather than loading the entire file into memory and then taking references to in-memory elements, the program records in-file positions, and then prints out the file as read at those locations.

Hope you find it useful,
-v

Update: jdporter reminded me it's "GRT" not "GSR". Thanks!

# An evil collection of one-liners to make this five-line # script which sorts a file by a date field. It's designed # for massive files, and indexes directly into the file. open(FD,"data") || die; my ($loc,@a) = (tell(FD)); push(@a,[$loc,(split /\|/)[4]]),$loc = tell(FD) while(<FD>); # 4 is th +e datefield seek(FD,$_,0), $_ = <FD>, print for map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [$_->[0], sprintf "%04d%02d%02d",(split /\//,$_->[1])[2,0,1]] +} @a; close FD;

Replies are listed 'Best First'.
Re: Sort Large Files
by Limbic~Region (Chancellor) on Jan 05, 2005 at 21:22 UTC
    Velaki,
    In your description, you say the date is in the format MM/DD/YY, but in your code it looks like it is a 4 digit year. Out of idle curiosity, does the following run any faster?
    #!/usr/bin/perl use warnings; use strict; my $file = $ARGV[0] || 'bigfile.dat'; open (INPUT, '<', $file) or die "Unable to open $file for reading : $! +"; my %date; my $pos = tell INPUT; while ( <INPUT> ) { my @field = split /\|/; push @{ $date{ join '', (split m|/|, $field[4])[2,0,1] } }, $pos; $pos = tell INPUT; } for ( sort { $a <=> $b } keys %date ) { for ( @{ $date{ $_ } } ) { seek INPUT, $_, 0; print scalar <INPUT>; } }
    I am also a little curious if they generate the same output. My version should preserve the order when two or more dates are the same. For anyone interested in micro-optimization, unpack 'x6A4X10A2xA2', $field[4] might be faster than the split/slice combo but is untested.

    Cheers - L~R

      Nifty! I'll give it a try! Elimitating the sprintf should speed it up. I just whipped it up adhoc in about 15 minutes to solve a quick problem. I like what you've done with it.

      Thanks!
      -v
      "Perl. There is no substitute."
Re: Sort Large Files
by Anonymous Monk on Jan 06, 2005 at 12:06 UTC
    If it's a really large file, I'd let the shell sort it:
    # This is *NOT* a useless use of cat. cat $file |\ perl -ne 'printf "%04d%02d%02d %s", (split '/')[2,0,1], $_' |\ sort |\ cut -d ' ' -f 2
    This also uses a GRT ;-)
      # This is *NOT* a useless use of cat. cat $file |\
      In spite of the tag, it is in fact a useless use of cat. So, I wonder why the original poster went out of their way to try to say it wasn't? {sigh}

      Also, those backslashes at the end of lines give me the willies. The shells that I use don't need them.

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

        It of course isn't.

        You cannot always replace

        cat $file | prog
        with
        prog < $file
        A few cases on which it fails:
        file="" file="data1 data2" file="-s squeeze-my-blanks"

        In this particular pipeline, $file could have been placed after the perl command (if you can assume $file doesn't have a switch for cat). However, that would place the data to act on somewhere in the middle of the pipeline. Which I find harder to understand. Flow should go from right to left, left to right, top to bottom, or bottom to top. But not middle, left, right. cat is short, just three letters, which places the data nearly at the beginning. Placing the entire pipeline in parens, and putting < $file at the end places the data at the end, but you can't do that because of the reasons listed earlier.

        That's two reasons why the use of cat wasn't useless.

        So, I wonder why the original poster went out of their way to try to say it wasn't?

        Because this is Perlmonks, and this is where dr. Pavlov would have a field day if he was still alive. The original poster had hoped that by saying the use of cat wasn't useless people would stop and think before reacting reflexly - but I guess the cerebral cortex was once again victorious over the brains.

        Also, those backslashes at the end of lines give me the willies. The shells that I use don't need them.

        Good for you. My preferred shells don't use them either, but I wasn't going to spend time figuring out which shells need them and which ones don't (as I don't know which shells the readers are usgin) so I just used a syntax that should work regardless whether the shell needs them or not. A bit of portability at the cost of three keystrokes, not bad, is it?