Sorting big text lists

Infinity has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Sorting big text lists by DamnDirtyApe (Curate) on Jul 30, 2002 at 18:10 UTC
Functions of particular interest to you will be split and sort. Here's a code example to get you on your way. #! /usr/bin/perl use strict ; use warnings ; $\|++ ; # Read the data into a 2d array chomp( my @lines = <DATA> ) ; my @data = () ; foreach ( @lines ) { my @cells = split ; push @data, \@cells ; } # Sort the list by user. my @data_by_name = sort { $a->[0] cmp $b->[0] } @data ; # Sort the list by date & time. my @data_by_date = sort { $a->[3] <=> $b->[3] \|\| $a->[4] <=> $b->[4] } @data ; __DATA__ u12345 x10 qwerty 20020725 1421 u12357 x11 asdf;; 20020727 1524 u12245 x12 perl 20020722 1941 u12145 x13 python 20020725 1825 u12945 x14 /bin/sh 20020724 1331 u12545 x15 grep 20020721 1921 [download] _______________ D a m n D i r t y A p e Home Node \| Email	[reply] [d/l]
Re: Re: Sorting big text lists by Chady (Priest) on Jul 31, 2002 at 13:55 UTC
`foreach ( @lines ) { my @cells = split ; push @data, \@cells ; }` [download] I wouldn't exactly be showing this to a newbie. As the original poster pointed out, he has "never even programmed", I have been using perl for about two years, and I still have some problems with references.. maybe building a hash with original lines as keys and specific column as values would be easier to understand (but maybe harder on the memory, depends on how big is the list.) don't you agree? He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life. Chady \| http://chady.net/	[reply] [d/l]
Re: Re: Re: Sorting big text lists by DamnDirtyApe (Curate) on Jul 31, 2002 at 15:16 UTC
++, though I tend to disagree regarding references. Your suggestion was to build a hash with original lines as keys, and a specific column as values. To me, that seems inefficient, and no easier to understand. The data is already in a tabular format. The operations we want to do (sorting by a specific column) are easily realized with tabular data. Why not, then, continue to treat the data as a table once it's loaded into memory? The only thing I might do differently here would be to use an AoH (array of hashes) instead of an AoA (array of arrays), thus giving the columns names instead of numbers. If I am wrong, and the code I presented is actually way over the original poster's head, forgive me. Here are some resources that will help: perlreftut, perlref perldsc, perldata perllol _______________ D a m n D i r t y A p e Home Node \| Email	[reply]
Re: Sorting big text lists by krujos (Curate) on Jul 30, 2002 at 18:15 UTC
Perl has a sort built in to do things like this. You will also want to checkout split. Depending on how the files are set up (delimiter) you are going to want to split on that and sort by the correct element. An example (untested) for something like this would be the following. We are splitting on commas. #!/usr/bin/perl -w use strict; #always use strict my %sortKey; open (INFILE, "filename"); #open the file while <INFILE> { #iterate through the file line by line my @tmp = split /,/; #split the file on the comma #for the sake of the example we will say that the first #element of the file is what you want to sort on #so we will make it the key to our hash $sortKey{$tmp[0]}=$_; } open (OUTFILE ">outfile"); #this will take the keys, shove them in an array, sort #them and set $key equal to the current sorted key foreach my $key (sort(keys(%sortKey))) { print OUTFILE " %sortKey{$key}; } [download] Hopefully this will give you a pretty good idea how to do the first part, and enough info on how to do the second part. Feel free to /msg me if you have questions. UPDATE: fixed a syntax error	[reply] [d/l]
Re: Sorting big text lists by grinder (Bishop) on Jul 30, 2002 at 18:32 UTC
One monk's big is another monk's small. How big are these files? 10 thousand records? 40 million records? It may well be that the file sizes are small enough that you can safely sort within Perl: it won't take too much memory, and it won't take too long. But there is a threshold to be aware of: when the file reaches a certain size with respect to free RAM available on your computer it is more efficient to use the sort utility that comes with the operating system. (Unless you happen to be stuck on Windows, although Cygwin can can help you out there). A sufficiently full-featured sort utility will be able to sort your file on userID and by time within userID (ascending or descending) in a single run, and it will probably be faster than Perl could do it by an order of magnitude or two. Once it is sorted it will be a snap to write a simple Perl script to walk down the file and split it out into new files when the userID changes, and those new files will already be sorted. Back in the '60s, someone (Knuth? Hoare? Dijkstra?) observed that 50% of CPU time is spent sorting. In this age of GUIs, that proportion has no doubt decreased, but you can be sure that the sort utility that comes with your OS has had an awful lot of time spent on it making sure it runs as fast as possible (especially when the files exceed the amount of available RAM). Know when to use it. <update> Given a datafile as follows (I'm assuming your data really are separated by dashes): u213-alpha-r-2002/03/19-00:09 u213-alpha-q-2002/03/19-00:08 u213-alpha-j-2002/03/19-00:01 u214-bravo-k-2002/03/19-00:02 u214-bravo-l-2002/03/19-00:03 u214-bravo-o-2002/03/19-00:06 u214-bravo-n-2002/03/19-00:05 u214-bravo-t-2002/03/19-00:11 u214-bravo-u-2002/03/19-00:12 u212-charlie-m-2002/03/19-00:04 u212-charlie-v-2002/03/19-00:13 u212-charlie-p-2002/03/19-00:07 u212-charlie-w-2002/03/19-00:14 u213-delta-s-2002/03/19-00:10 You can sort this using - as a delimiter, on the first column and then on the 4th column (descending) with the following command: `sort -t- -k1 -r -k4 file.dat >file.sorted` Hope this helps. </update> print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'	[reply] [d/l]
Re: Re: Sorting big text lists by Infinity (Initiate) on Jul 30, 2002 at 23:02 UTC
In this case my big is about 590,000 entries. I have tried the above two scripts and both have generated errors and I have no idea how to correct them. I am using Red Hat 7.3. I'm not sure what sort utilities it has that I can use. I am a little familiar with the bash shell so if there's anything I can do using that it might help. Thanks.	[reply]
Database? by BorgCopyeditor (Friar) on Jul 30, 2002 at 19:36 UTC
Your differing potential sort requirements make this sound like a problem that could be usefully approached by means of a database. Perl has lots of nifty ways to access information stored in databases, and even to treat simple files as databases. I know it sounds like learning yet another technology, but databases are pretty straightforward and easy to talk to. You could look at the DBI modules, for starters. BCE --Your punctuation skills are insufficient!	[reply]
Re: Sorting big text lists by Cine (Friar) on Jul 30, 2002 at 19:58 UTC
Why use perl for this trivial piece of work? The usual sort command in *nix will do the job by default (or if UserID is numeric and -n is needed) T I M T O W T D I	[reply]
Re: Re: Sorting big text lists by KristiePi (Initiate) on Jul 30, 2002 at 23:49 UTC
I'm working on the same project with Infinity. How would we do this sort command in *nix? I think we may have already tried this.	[reply]
Re: Re: Re: Sorting big text lists by Cine (Friar) on Jul 31, 2002 at 08:01 UTC
man sort T I M T O W T D I	[reply]