fiddler42 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, All,

I have a gigantic array named @AllPatterns, and each entry in the array consists of exactly 5 numbers, formated like so:-

1 2 1 1 8

1 6 5 12 12

1 1 1 1 1

1 1 1 1 1

etc. In other words: a decimal number always followed by exactly one blank space. This is guaranteed.

I need to do a couple of important sorts, and I am not sure how to pull them off. The sorts need to go like this:-

1. Sort all data by the first, then second, then third, then fouth, then 5th column.

2. After the sort completes, collapse the array until it possesses nothing but unique combinations of numbers.

3. When collapsing the table to its unique entries, keep track of all the occurences that got collapsed.

Based on the simple example above, the end result would be:-

2 - 1 1 1 1 1

1 - 1 2 1 1 8

1 - 1 6 5 12 12

...where the first number is the number of collapsed entires.

I think I know how to do number #1, but I am not sure how to do #2 and #3. Any suggestions?

Thanks,

-fiddler42

Replies are listed 'Best First'.
Re: Need to sort large table of data...
by igelkott (Priest) on Feb 18, 2008 at 23:22 UTC
    Just a hack but how about the following to get started:
    # The @AllPatterns array is probably fed elsewhere, this is just for t +he test @AllPatterns = ([qw(1 2 1 1 8)], [qw(1 6 5 12 12)], [qw(1 1 1 1 1)], [qw(1 1 1 1 1)]); for $v (@AllPatterns) { ++$h{join(' ',sort {$a <=> $b} @$v)}; } for $k (sort keys %h) { print "$h{$k} - $k\n"; }

    This will probably choke (run out of memory) on a "gigantic array" so you might need to write to disk after each step you outlined.

    Update: Just struck me that you may have a simple array rather than a 2D as I has assumed. Possibly clear how to adapt the hack but here it is just to get it out of my head:

    @AllPatterns = ('1 2 1 1 8', '1 6 5 12 12', '1 1 1 1 1', '1 1 1 1 1'); for $v (@AllPatterns) { ++$h{join(' ',sort {$a <=> $b} split(' ',$v))}; }
    and then print as before.
      Hi, igelkott,

      Your suggestion worked smashingly on the first try. Good job!

      Thanks,

      -fiddler42

Re: Need to sort large table of data...
by Zen (Deacon) on Feb 18, 2008 at 23:02 UTC
    Look up perl hashes. Key is unique (one of your requirements), value is how many times. Then you can print for some condition.
Re: Need to sort large table of data...
by quester (Vicar) on Feb 19, 2008 at 05:39 UTC
    Incidentally, if your array is really "gigantic" - so much so that you need to process it externally to Perl - you may want to write it to a file and sort it with the unix "sort" utility. The following two lines will do everything you want except for the hyphen after the count:
    sort -k1n -k2n -k3n -k4n -k5n < inputfilename | uniq -c > outputfilename
    and the output will look like this:
    2 1 1 1 1 1 1 1 2 1 1 8 1 1 6 5 12 12
      Be careful to double-check your locale.

      On Linux with GNU sort I've been bitten in the past by the fact that the default locale does not consider some characters (in that case tabs) to be part of the collating operation.

      For that reason I find it a good precaution to set $LC_ALL to C before using GNU sort.

Re: Need to sort large table of data...
by Anonymous Monk on Feb 19, 2008 at 03:24 UTC
    The solutions provided by igelkott both produce the same output for the example input data:
    2 - 1 1 1 1 1 1 - 1 1 1 2 8 1 - 1 5 6 12 12

    However, this output does not conform to the example output given by the OP:
    2 - 1 1 1 1 1 1 - 1 2 1 1 8 1 - 1 6 5 12 12

    The following code produces output that seems to conform to the OP's exemplified requirement (the two added values of '0 0 0 0 0' test for proper handling of leading-zero removal):

    2 - 0 0 0 0 0 2 - 1 1 1 1 1 1 - 1 2 1 1 8 1 - 1 6 5 12 12

    Code:

    use warnings; use strict; my @AllPatterns = ( '1 2 1 1 8', '0 0 0 0 0', '1 6 5 12 12', '0 0 0 0 0', '1 1 1 1 1', '1 1 1 1 1', ); { # limit scope of array processing variables my $biggest = 10; # biggest number (most decimal digits) my $fmt = join ' ', ("%0${biggest}ld") x 5; my $lead0s = qr{ \b 0{1,@{[ $biggest - 1 ]}} }xms; my %seen; # unique-ifing hash @AllPatterns = # 7. save to original array map { "$seen{$_} - $_" } # 6. make into final format grep { not $seen{$_}++ } # 5. only patterns not seen yet map { s{$lead0s}{}xmsg; $_ } # 4. remove padding sort # 3. lexicographic ascending sort # map { print "'$_' \n"; $_ } # 2b. FOR DEBUG map { sprintf $fmt, split } # 2. pad fields to constant widths @AllPatterns # 1. for all patterns... ; } # end scope of array processing variables print join("\n", @AllPatterns), "\n";
Re: Need to sort large table of data...
by tilly (Archbishop) on Feb 19, 2008 at 07:35 UTC
    Two notes.

    The first is that I always have to wonder what people mean when they say "gigantic". I've seen people say that about data sets from a few thousand lines to billions. Usually people mean something more like the former, which is a much easier question to answer than the latter.

    The second is that if you're regularly faced with this kind of problem, you should get familiar with how to use databases. With a database you could give your columns real names, and the sorting and grouping functionality that you asked for is standard. (Along with many more complicated features.)