Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to sort on the first two columns of data using the sub below:
sub RowSort { my($aa) = $a =~ /(\d+)\,(\d+)/; my($bb) = $b =~ /(\d+)\,(\d+)/; $aa <=> $bb; }
Here is my current data:
1,64,1.4.5,1.4.6,44642850,44642850,0,27348,10028,59188,1488095,761904. +64 1,128,1.4.5,1.4.6,25337850,25337850,0,19236,10276,28196,844595,864865. +28 1,256,1.4.5,1.4.6,13489200,13489200,0,17792,11372,17832,449640,920862. +72 1,512,1.4.5,1.4.6,6996270,6996270,0,18084,16744,19124,233209,955224.06 +4 1,1024,1.4.5,1.4.6,3557880,3557880,0,31528,20488,35188,118596,971538.4 +32 2,64,1.4.5,1.4.6,44642850,44642850,0,25828,9548,40128,1488095,761904.6 +4 2,128,1.4.5,1.4.6,25337850,25337850,0,27936,10796,28696,844595,864865. +28 2,256,1.4.5,1.4.6,13489200,13489200,0,12852,10692,13332,449640,920862. +72 2,512,1.4.5,1.4.6,6996270,6996270,0,17184,15904,18844,233209,955224.06 +4 2,1024,1.4.5,1.4.6,3557880,3557880,0,34068,17948,36628,118596,971538.4 +32
And here is my expected output:
# result should be sorted: 1,64,1.4.5,1.4.6,44642850,44642850,0,27348,10028,59188,1488095,761904. +64 2,64,1.4.5,1.4.6,44642850,44642850,0,25828,9548,40128,1488095,761904.6 +4 1,128,1.4.5,1.4.6,25337850,25337850,0,19236,10276,28196,844595,864865. +28 2,128,1.4.5,1.4.6,25337850,25337850,0,27936,10796,28696,844595,864865. +28 1,256,1.4.5,1.4.6,13489200,13489200,0,17792,11372,17832,449640,920862. +72 2,256,1.4.5,1.4.6,13489200,13489200,0,12852,10692,13332,449640,920862. +72 1,512,1.4.5,1.4.6,6996270,6996270,0,18084,16744,19124,233209,955224.06 +4 2,512,1.4.5,1.4.6,6996270,6996270,0,17184,15904,18844,233209,955224.06 +4 1,1024,1.4.5,1.4.6,3557880,3557880,0,31528,20488,35188,118596,971538.4 +32 2,1024,1.4.5,1.4.6,3557880,3557880,0,34068,17948,36628,118596,971538.4 +32
Any thougts?

Replies are listed 'Best First'.
Re: numeric sort on substring
by kennethk (Abbot) on Jan 06, 2011 at 16:36 UTC
    The issue with your regular expression is that you are capturing the first number, not the second, into your buffer. You could get your expected result modifying your regular expression to not capture the first digits:

    #!/usr/bin/perl use strict; use warnings; my @data = grep $_, <DATA>; print sort RowSort @data; sub RowSort { my($aa) = $a =~ /\d+,(\d+)/; my($bb) = $b =~ /\d+,(\d+)/; $aa <=> $bb; } __DATA__ 1,64,1.4.5,1.4.6,44642850,44642850,0,27348,10028,59188,1488095,761904. +64 1,128,1.4.5,1.4.6,25337850,25337850,0,19236,10276,28196,844595,864865. +28 1,256,1.4.5,1.4.6,13489200,13489200,0,17792,11372,17832,449640,920862. +72 1,512,1.4.5,1.4.6,6996270,6996270,0,18084,16744,19124,233209,955224.06 +4 1,1024,1.4.5,1.4.6,3557880,3557880,0,31528,20488,35188,118596,971538.4 +32 2,64,1.4.5,1.4.6,44642850,44642850,0,25828,9548,40128,1488095,761904.6 +4 2,128,1.4.5,1.4.6,25337850,25337850,0,27936,10796,28696,844595,864865. +28 2,256,1.4.5,1.4.6,13489200,13489200,0,12852,10692,13332,449640,920862. +72 2,512,1.4.5,1.4.6,6996270,6996270,0,17184,15904,18844,233209,955224.06 +4 2,1024,1.4.5,1.4.6,3557880,3557880,0,34068,17948,36628,118596,971538.4 +32

    Using YAPE::Regex::Explain to parse the regex:

    The regular expression: (?-imsx:\d+,(\d+)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- \d+ digits (0-9) (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- , ',' ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- \d+ digits (0-9) (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

    See perlretut.

      Or do you need:

      sub RowSort { my ($a1, $a2) = $a =~ /(\d+)\,(\d+)/; my ($b2, $b2) = $b =~ /(\d+)\,(\d+)/; my $am = $a1.$a2; my $bm = $b1.$b2; $am <=> $bm; }


      Be Appropriate && Follow Your Curiosity
Re: numeric sort on substring
by Anonyrnous Monk (Hermit) on Jan 06, 2011 at 16:41 UTC
    I'm trying to sort on the first two columns
    sub RowSort { my($a1, $a2) = $a =~ /(\d+),(\d+)/; my($b1, $b2) = $b =~ /(\d+),(\d+)/; $a2 <=> $b2 or $a1 <=> $b1; }

    Chaining comparisons with 'or' has the effect that if the first one says 'equal' (<=> yields 0), the next comparison is being tested, etc.

Re: numeric sort on substring
by moritz (Cardinal) on Jan 06, 2011 at 16:58 UTC

      As that would sort by the second column only, it would fail to yield the desired output in case the input was sorted differently.  For example, if all the rows with "2" in the first column came first in the input, the output would be

      2,64,1.4.5,1.4.6,44642850,44642850,0,25828,9548,40128,1488095,761904.6 +4 1,64,1.4.5,1.4.6,44642850,44642850,0,27348,10028,59188,1488095,761904. +64 2,128,1.4.5,1.4.6,25337850,25337850,0,27936,10796,28696,844595,864865. +28 1,128,1.4.5,1.4.6,25337850,25337850,0,19236,10276,28196,844595,864865. +28 ...
Re: numeric sort on substring
by Jim (Curate) on Jan 07, 2011 at 00:24 UTC

    Here's a way to do it using split within a Schwartzian Transform:

    #!/usr/bin/perl use strict; use warnings; my @data = <DATA>; # Schwartzian Transform print map { $_->[0] } sort { $a->[1][1] <=> $b->[1][1] or $a->[1][0] <=> $b->[1][0] } map { [ $_, [ (split m/,/, $_, 3)[0, 1] ] ] } @data; __DATA__ 1,64,1.4.5,1.4.6,44642850,44642850,0,27348,10028,59188,1488095,761904. +64 1,128,1.4.5,1.4.6,25337850,25337850,0,19236,10276,28196,844595,864865. +28 1,256,1.4.5,1.4.6,13489200,13489200,0,17792,11372,17832,449640,920862. +72 1,512,1.4.5,1.4.6,6996270,6996270,0,18084,16744,19124,233209,955224.06 +4 1,1024,1.4.5,1.4.6,3557880,3557880,0,31528,20488,35188,118596,971538.4 +32 2,64,1.4.5,1.4.6,44642850,44642850,0,25828,9548,40128,1488095,761904.6 +4 2,128,1.4.5,1.4.6,25337850,25337850,0,27936,10796,28696,844595,864865. +28 2,256,1.4.5,1.4.6,13489200,13489200,0,12852,10692,13332,449640,920862. +72 2,512,1.4.5,1.4.6,6996270,6996270,0,17184,15904,18844,233209,955224.06 +4 2,1024,1.4.5,1.4.6,3557880,3557880,0,34068,17948,36628,118596,971538.4 +32

    UPDATE: If you prefer regular expression pattern matching to split-ting in this case, just replace the initial map with this:

    map { [ $_, [ m/^(\d+),(\d+)/ ] ] }

      I'm wondering why you add the complication of an inner anonymous array and a three-argument split. I think neither are necessary and, since split defaults to operation on $_ one argument suffices.

      print for map { $_->[ 0 ] } sort { $a->[ 1 ] <=> $b->[ 1 ] || $a->[ 2 ] <=> $b->[ 2 ] } map { [ $_ , ( split m{,} )[ 1, 0 ] ] } <DATA>;

      You could also use a Guttman Rosler transform.

      print for map { substr $_, 8 } sort map { pack q{NNA*}, ( split m{,} )[ 1, 0 ], $_ } <DATA>;

      I hope this is of interest.

      Cheers,

      JohnGG

        In hindsight, the complication of the inner anonymous array is needless. It reflects how my mind reckoned the data structure at the moment I wrote the transform.

        The three-argument split is just a habit. The habit is based on the documentation, which states: "In time critical applications it behooves you not to split into more fields than you really need." I don't know if the OPs application is time-critical or not. I went with the more conservative assumption. Like I said: habit.

        I like the regular expression pattern matching version better anyway.