DeusVult has asked for the wisdom of the Perl Monks concerning the following question:

Ok, I need some advise on good sorting modules/commands/techniques. In the interest of honesty, I will admit that I have not done as much research on my own before asking this question as I normally would, but there are some time restraints. Any and all help would be appreciated.

The problem domain: I have a text file, which could be looked at like as a table, or a really cheesy DB. Each line is an entry, and the fields are space-separated. I want to sort this file based upon one of the fields.

Each of these fields is formatted in the following manner:

  1. A string (any string)
  2. Followed by a two digit number
  3. Followed by another string (again, any string)
  4. Followed by either _$string1 or _$string2 (there are two possible endings for each entry. These endings are not related in any way to the strings from 1. and 3.)

In case you are wondering, I am not the one responsible for this craziness.

Now here is the interesting part: The sort doesn't key off alphabetical order, or anything sane like that. Oh, no. Instead, I must sort this file according to the following little rules, listed in order of precedence:

  1. Entries ending in "_$string1" come before entries ending in "_$string2".
  2. Those entries with a higher two-digit number must come before those with a lower number.
  3. Some of the entries don't follow the above format. These entries must come after any that do.
  4. The strings should be in otherwise alphabetical order (both the string before the two digit number and the string after it, with the string before it given priority).

Could someone give me a shove in the general direction I should be looking? Does perl have any built-in or module-based sorting tools more complex than the sort command? Do any of them come close to handling things even remotely this weird?

Some people drink from the fountain of knowledge, others just gargle.

Replies are listed 'Best First'.
Re: Rule-based sorting
by stephen (Priest) on Apr 10, 2001 at 03:30 UTC

    Actually, it sounds to me like the sort() command is precisely what you need. Don't forget, you can define your own sorting methods.

    In fact, this looks like a job for the Schwartzian Transform. First, you parse out all your strings at once, like so:

    $string1 = 'foo'; $string2 = 'bar'; @in_strings = ( 'blue15midget_bar', 'this is an exception', 'blue17midget_foo', 'deep13gizmonic_bar', 'another exception', 'red12dwarf_foo', 'red12dwarf_bar', 'exception containing _foo', ); %string_table = (); foreach my $in_string (@in_strings) { if ( $in_string =~ m{(.+) (\d{2,2}) (.+) _ ($string1 | $string2 ) +$ }x \ ) { $string_table{$in_string} = { initial => $1, number => $2 + 0, another => $3, ending => $4, }; } else { $string_table{$in_string} = 'EXCEPTION'; } }
    Next, you write a sorting routine that checks these parsed strings.
    %end_order = ( $string1 => 0, $string2 => 1, ); sub funky_sort { my $p_a = $string_table{$a}; my $p_b = $string_table{$b}; if ( ($p_a eq 'EXCEPTION') && ($p_b eq 'EXCEPTION') ) { return $p_a cmp $p_b; } elsif ( $p_a eq 'EXCEPTION' ) { return 1; } elsif ( $p_b eq 'EXCEPTION' ) { return -1; } else { $p_a->{ending} eq $p_b->{ending} or return ( $end_order{ $p_a-> +{ending} } <=> $end_order{ $p_b->{ending} } ); $p_a->{number} == $p_b->{number} or return ( -1 * ( $p_a->{number} <=> $p_b->{number} ) ); $p_a->{initial} eq $p_b->{initial} or return ( $p_a->{initial} cmp $p_b->{initial} ); $p_a->{another} eq $p_b->{another} or return ( $p_a->{another} cmp $p_b->{another} ); # If all of these are true, then the two are equal. return 0; } }
    Then, finally, you can use sort() to sort these guys.
    foreach my $string (sort funky_sort @in_strings) { print $string, "\n"; }
    And your stuff is sorted! See the manual entry on sort().

    Disclaimer: I've tested this code to make sure that it runs, but haven't run extensive tests on it. Double-checks are appreciated and recommended.

    stephen

      You're my hero :)

      I've tested this out, and it seems to work very well. However, I am really confused about that, because it doesn't seem like it should even compile, much less work.

      First off, there are these lines:

      my $p_a = $string_table{$a}; my $p_b = $string_table{$b};

      Now, I checked, and those are the only times $a and $b are used. They are never declared (no my's anywhere!) and they are never given a value. It seems like use strict should hop all over this like a rapid chef with a meat cleaver, but it doesn't. Even still, it seems like you're assigning to $p_a and $p_b from empty variables, but yet they clearly receive values. So my response is "huh?" Are $a and $b some sort of special variables that result from sort, or is something even funkier going on?

      Also, &funky_sort only sorts two things. So when you call funky_sort @in_strings, it would seem that it would only sort the first two things, and leave the rest of the array alone. Of course, it doesn't do that at all, but why? Is it a consequence of calling sort first?

      So I really want to thank you for your help, and this truly cool scrap o' code. However, you really made my head hurt :)

      Some people drink from the fountain of knowledge, others just gargle.

        I'm not sure whay you're saying about funky_sort(), but as for $a and $b these are special variables in the context of doing a sort. see sort for a full explanation and a rather explicit warning about trying to declare them lexically. =)
        "A man's maturity -- consists in having found again the seriousness one had as a child, at play." --Nietzsche
(tye)Re: Rule-based sorting
by tye (Sage) on Apr 10, 2001 at 03:36 UTC

    I tend to prefer to compute strings that can be sorted using the "default" sort even for problems as complex as this:

    chomp( my @strs= <DATA> ); my @sorts= map { if( ! /^(.*)(\d\d)(.*)_(\Q$str1\E|\Q$str2\E)$/ ) { "3".$_; # Non-matching lines come last } else { ( $4 eq $str1 ? "1" : "2" ) # Sort on _$str[12] first . sprintf("%02d",99-$2) # Next on digits (reverse order) . $1 . $3; # Lastly sort on middle strings } } @strs; return @strs[ sort { $sorts[$a] cmp $sorts[$b] } 0..$#sorts ];
    This may be significantly faster than writing a complex "compare" routine to pass to sort if you have a lot of items to sort.

            - tye (but my friends call me "Tye")
Re: Rule-based sorting
by MeowChow (Vicar) on Apr 10, 2001 at 02:38 UTC
    What would you need a module for? Simply encode your ruleset into a comparitor subroutine, and use the sort built-in:
    sub compare_lines { # return a 1,0,or -1 depending on which of $a or $b is larger } my @lines = <FILE>; sort compare_lines @lines;
    Remember, sort is an abstraction of sorts :) If you can express, in code, whether $a is >, <, or == $b, you can use sort.
       MeowChow                                   
                   s aamecha.s a..a\u$&owag.print
      Rather than creating a compare function, I'd create a sort key generation function, and use the map sort map (Schwartzian transform) or the Guttman/Rossler method - (warning: untested, and if your 'any string' clauses contain two digits, then your hosed unless you come up with more constraints):
      # You may want to wrap '\Q' and '\E' around each of $str1 and $str2 my $re = qr/^.*(\d\d).*_($str1|$str2)$/; my @sorted = map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [ $_, sort_key($_, $re, $str1) ] } @list; sub sort_key { my ($str, $re, $str1) = @_; my ($rule1, $rule2, $rule3) = ("1","00","1"); if ($str =~ $re) { $rule3 = "0"; my ($num, $s) = ($1, $2); $rule1 = "0" if $s eq $str1; $rule2 = sprintf("%2d", 99-$num); } return "$rule3$rule1$rule2$str"; }
Re: Rule-based sorting
by I0 (Priest) on Apr 10, 2001 at 03:59 UTC
    sub ST(&@){ my $metric=shift; map {$_->[0]} sort {$a->[1] cmp $b->[1]} map {[$_,&{$metric}]} @_ } print ST{ if( /(.*)(\d\d)(.*)_(\Q$string1\E|(\Q$string2\E))$/ ){ (0+defined($5)) . (200-$2) . $1 . $3 }else{ "2$_" } } <>;
Re: Rule-based sorting
by princepawn (Parson) on Apr 10, 2001 at 02:41 UTC
    This will be an easy task if you read the sorting section in the book "Data Munging with Perl."
Re: Rule-based sorting
by 2501 (Pilgrim) on Apr 10, 2001 at 03:49 UTC
    I would have liked to lean towards a map sort map option but i can't think of how i would handle criteria #3 with that solution. If somebody could think of a clean way to handle #3 that might be a nice way to go.

      Borrowing from I0, I offer this alternative:

      return map { if( s/^2// ) { $_; } elsif( /^([01])(\d\d\d)(.*)\Q$;\E(.*)$/ ) { $3 . substr(300-$2,-2) . $4 . "_" . ($str1,$str2)[$1] } } sort map { if( ! /(.*)(\d\d)(.*)_(?:\Q$string1\E|(\Q$string2\E))$/ ){ 2.$_; }else{ (0+defined($4)) . (200-$2) . $1 . $; . $3; } } @strs;

      which may be the fastest yet once you get enough elements to sort (and you have to set $; to something that won't be in your strings).

              - tye (but my friends call me "Tye")
      I would have liked to lean towards a map sort map option but i can't think of how i would handle criteria #3 with that solution.

      #3 is the rule where differently-formatted lines (i.e., not conforming to the spec) come after all other lines, and are sorted alphabetically.

      To handle these, you could do two passes, and append the differently formatted output to the other output. Actually, I prefer to separate the code for handling the two cases, since conceptually you may need to treat them quite differently.

      Also, if they're in a different formats, your specs for the sort may diverge more in the future, so any combined sort would have to be split then anyway.

        I know I'm five years late for this party but I came across this node when Googling for Guttman/Rosler transforms and thought the problem was interesting. Criteria #3 can be handled as part of a transform by recording whether the match succeeded or failed as part of the list emitted by the first map. I insert a second map before sorting to either transform string1/string2 into 0/1 for matching lines to aid sorting or, for non-matching lines, transform the undefs from the failed match into zero or "" to silence warnings from the sort.
        #!/usr/local/bin/perl -w # use strict; # Set up the two strings that we want to match and sort to the top # of our list. # our $string1 = shift or die "No strings\n"; our $string2 = shift or die "Only one string\n"; # Print out our data lines sorted using a ST but with an extra map bef +ore # the sort to tweak the results of the regular expression match done i +n # first map. # print map {$_->[0]} # map out original line for print sort { $b->[1] <=> $a->[1] || # elem. 1 is success (1) or failure (0 +) of # match so descending numeric puts g +ood # lines first $a->[5] <=> $b->[5] || # elem. 5 is our string1/string2 ranki +ng so # ascending numeric gets string1s fi +rst $b->[3] <=> $a->[3] || # elem. 3 is two-digit number, highs f +irst $a->[2] cmp $b->[2] || # ascending alpha for string at beginn +ing ... $a->[4] cmp $b->[4] # ... and the one following the digits } map { # If match below succeeded replace elem. 5 of anon. list, which + is # either string1 or string2, with a zero if string1 or a one if # string2 so we can do a numeric ascending sort to get all the # string1s and string2s in the right place. # if($_->[1]) { splice @$_, 5, 1, $_->[5] eq $string1 ? 0 : 1; } # Else match did not succeed so replace elems. 2 thru 5, which +will # contain undef, with appropriate empty strings or zeros so tha +t the # sort does not complain about "Use of uninitialized value ..." + if # warnings are switched on. # else { splice @$_, 2, 4, q(), 0, q(), 0; } # Pass anon. list out of map to sort above. # $_; } map { [ # anon. list constructor $_, # elem. 0 is original line /(?x) # do match with extended syntax ^ # anchor to beginning of line (\S+)\s+ # capture first string, then space(s) (\d\d)\s+ # capture the two digits, then space(s +) (\S+)\s+_ # capture string after digits, then sp +ace(s) # followed by underscore ($string1|$string2) # capture either string1 or string2 $ # anchor to end of line / ? 1 : 0, # elem. 1 is success or failure of mat +ch $1, $2, $3, $4 # elems. 2 thru 5 are our captives ] # close anon. list constructor, list # passed to next map above } <DATA>; # Read data a line at a time into firs +t # map of transform above exit; __END__ shirt 47 cotton,white _menswear blouse 88 cotton,pink _womenswear saucepan 82 s/steel,20cm _household singlet 83 cotton,grey _menswear duff_line 56 lots of extra fields so match fails mixer 59 multi_function,dough_hook,blender,black _white_good +s skirt 39 tweed,brown _womenswear chefs_knife 11 french_style,8inch _household trousers 27 jean,blue _menswear pepper_mill 51 beech_wood,ceramic _household shirt 15 polyester,lemon _menswear shirt 47 cotton,yellow _menswear duff_line 45 more extra fields so duff again shirt 47 cotton,blue _menswear socks 76 wool,blue,pack_of_5 _menswear microwave 40 850W,turntable,white _white_good +s boxers 84 cotton,tartan _menswear trousers 47 cotton,blue _menswear skirt 56 velvet,black_BUT_DUFF_COS_TRAILING_SPACE _womenswear + duff_line no_numbers_and_only_one_further_field t-shirt 29 nylon,black _menswear television 83 wide_screen,black _white_good +s tie 39 silk,blue _menswear butchers_block 45 beech _household deep_fat_fryer 27 white,non_stick _white_good +s blouse 55 silk,bronze_green _womenswear
        Running the script with arguments of menswear and household gives:-
        boxers 84 cotton,tartan _menswear singlet 83 cotton,grey _menswear socks 76 wool,blue,pack_of_5 _menswear shirt 47 cotton,blue _menswear shirt 47 cotton,white _menswear shirt 47 cotton,yellow _menswear trousers 47 cotton,blue _menswear tie 39 silk,blue _menswear t-shirt 29 nylon,black _menswear trousers 27 jean,blue _menswear shirt 15 polyester,lemon _menswear saucepan 82 s/steel,20cm _household pepper_mill 51 beech_wood,ceramic _household butchers_block 45 beech _household chefs_knife 11 french_style,8inch _household duff_line 56 lots of extra fields so match fails mixer 59 multi_function,dough_hook,blender,black _white_good +s microwave 40 850W,turntable,white _white_good +s duff_line 45 more extra fields so duff again duff_line no_numbers_and_only_one_further_field skirt 56 velvet,black_BUT_DUFF_COS_TRAILING_SPACE _womenswear + television 83 wide_screen,black _white_good +s skirt 39 tweed,brown _womenswear blouse 88 cotton,pink _womenswear deep_fat_fryer 27 white,non_stick _white_good +s blouse 55 silk,bronze_green _womenswear
        I formatted the data to make it obvious whether all aspects of the sort were working (which I think they are). I don't know if there would be a performance hit if sorting large amounts of data but this is a technique I have used a lot when sorting a few hundreds of items.
        @sort = map{...} sort{...} map{...} map{...} @list;
        or
        @sort = map{...} sort{...} grep{...} map{...} @list;
        or even
        @sort = map{...} sort{...} map{...} grep{...} map{...} @list;
        Cheers,

        JohnGG

Re: Rule-based sorting
by ton (Friar) on Apr 10, 2001 at 03:11 UTC
    Umm... I assume that the "any strings" of fields 1 and 3 cannot contain spaces? Otherwise we are in real trouble... how can you distinguish between spaces that are record seperators and those that are in the string?

    Example:

    Foo 23 45 Bar _$string1
    This could be parsed into the array ('Foo', 23, '45 Bar', '_$string1') or into the array ('Foo 23', 45, 'Bar', '_$string1')...