Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

complex sort

by mkmcconn (Chaplain)
on Dec 25, 2001 at 02:27 UTC ( [id://134235]=perlquestion: print w/replies, xml ) Need Help??

mkmcconn has asked for the wisdom of the Perl Monks concerning the following question:

I was given the following problem at work. A DATA sample and my solution follows the READMORE tag.

Problem: Print a directory of engineering drawings in the order determined by drawing number data contained in the filename:

 
 TYPE  Part  (Detail) Revision (ALT) layer
 1D     1              R1       B1    .vec
 18D    20     A       R0             .ras

Sort order:

  1. layer,
  2. reverse(ALT),
  3. TYPE,
  4. Part,
  5. Revision,
  6. Detail

ALT, if it exists, must be queued in reverse order (e.g. 3,2,1,'')
  • How would you have done it?
  • How does your solution scale better or work more reliably?
  • How would you make the solution easier to understand and maintain?

#!/usr/bin/perl -wl use strict; for (sort{ # TYPE: $t0_$t1_ # Part: $p0_ # Det : $pda # Rev : $r_ # ALT : $alt_ # Layer $l_ my ($t0a,$t1a,$p0a,$pda,$ra,$alta,$la) = map(m!(\d+)(\D+)(\d+)(\D*)R(\d+)B?(\d*)\.(\w+)$!,$a); my ($t0b,$t1b,$p0b,$pdb,$rb,$altb,$lb) = map(m!(\d+)(\D+)(\d+)(\D*)R(\d+)B?(\d*)\.(\w+)$!,$b); $la cmp $lb || ($altb eq ''? 0 :$altb) <=> ($alta eq '' ? 0 : $alta) || $t0a <=> $t0b || $t1a cmp $t1b || $p0a <=> $p0b || $ra cmp $rb || $pda cmp $pdb; }<DATA>){ chomp; print; } __DATA__ 18DD13AR0B2.vec 5A344R2B15.ras 5A344R2B15.vec 5A35SR0.ras 113A28R1.vec 113A28AR0.vec 113A29R0.ras 5A32R0.ras 113A29R1.vec 18AD13XR0B3.ras 18AD13XR0B3.vec 113A22R1.ras 5A35R0.vec 5A34R0.ras 113A22R1.vec 18DD12YR0B2.vec 18DD13AR0B2.ras 113A29AR0.ras 113A29AR2.vec 113A28R0.vec 5A33R0.vec 5A34R0.vec 5A35R0.ras 113A28AR2.vec 113B28R2.ras 113A28R2.vec 13B29AR0B1.ras 113B29AR0B1.vec 5A32R0.vec 5A33R0.ras

Explanation:

  • separate alpha and numeric values to sort separately,
  • compare in order of highest to lowest priority,
  • compare lower priority values only if higher order comparison returns '0'; mkmcconn
  • Replies are listed 'Best First'.
    (Ovid) Re: complex sort
    by Ovid (Cardinal) on Dec 25, 2001 at 03:58 UTC

      Using the same regex twice in a row suggests to me that there are some efficiency gains here. I used a Schwartzian transform, reordered the data (not really necessary), and did benchmarks. Complete code is below. The ST appears to run about twice as fast.

      Also, you have a string comparison where you needed a numberic comparison:

      $ra cmp $rb || # this is number. Shouldn't this be <= +> ?

      Also, at the end, I write out the arrays to a file so you can verify the sort order.

      #!/usr/bin/perl -wl use strict; use Benchmark; use vars qw/ @data /; use vars qw/ @results1 @results2 /; @data = qw/ 18DD13AR0B2.vec 5A344R2B15.ras 5A344R2B15.vec 5A35SR0.ras 113A28R1.vec 113A28AR0.vec 113A29R0.ras 5A32R0.ras 113A29R1.vec 18AD13XR0B3.ras 18AD13XR0B3.vec 113A22R1.ras 5A35R0.vec 5A34R0.ras 113A22R1.vec 18DD12YR0B2.vec 18DD13AR0B2.ras 113A29AR0.ras 113A29AR2.vec 113A28R0.vec 5A33R0.vec 5A34R0.vec 5A35R0.ras 113A28AR2.vec 113B28R2.ras 113A28R2.vec 13B29AR0B1.ras 113B29AR0B1.vec 5A32R0.vec 5A33R0.ras /; timethese(5000, { 'mkmcconn' => ' @results1 = (); for (sort{ my ($t0a,$t1a,$p0a,$pda,$ra,$alta,$la) = map(m!(\d+)(\D+)(\d+)(\D*)R(\d+)B?(\d*)\.(\w+)$!,$a); my ($t0b,$t1b,$p0b,$pdb,$rb,$altb,$lb) = map(m!(\d+)(\D+)(\d+)(\D*)R(\d+)B?(\d*)\.(\w+)$!,$b); $la cmp $lb || ($altb eq ""? 0 :$altb) <=> ($alta eq "" ? 0 : $alta) || $t0a <=> $t0b || $t1a cmp $t1b || $p0a <=> $p0b || $ra cmp $rb || $pda cmp $pdb; } @data ) { push @main::results1, $_; } ', 'Ovid' => ' @main::results2 = (); @main::results2 = map { $_->[0] } sort { $a->[1][0] cmp $b->[1][0] || $b->[1][1] <=> $a->[1][1] || $a->[1][2] <=> $b->[1][2] || $a->[1][3] cmp $b->[1][3] || $a->[1][4] <=> $b->[1][4] || $a->[1][5] <=> $b->[1][5] || $a->[1][6] cmp $b->[1][6] } map { [$_, main::get_data( $_ )] } @main::data; ', }); use Data::Dumper; open T, "> test.txt" or die $!; print T Dumper \@results1, \@results2; close T; sub get_data { local $_ = shift; my @data = ( m!(\d+)(\D+)(\d+)(\D*)R(\d+)B?(\d*)\.(\w+)$! ); # add default, if necessary $data[3] ||= ''; $data[4] ||= 0; $data[5] ||= 0; # reorder the data for sorting @data = @data[ 6, 5, 0, 1, 2, 4, 3 ]; return \@data; }

      Results:

      C:\>perl test.pl Benchmark: timing 5000 iterations of mkmcconn, Ovid ... mkmcconn: 69 wallclock secs (68.61 usr + 0.00 sys = 68.61 CPU) @ + 72.88/s (n=5000) Ovid: 32 wallclock secs (32.68 usr + 0.00 sys = 32.68 CPU) @ 15 +3.00/s (n=5000)

      Cheers,
      Ovid

      Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

    Re: complex sort
    by Juerd (Abbot) on Dec 25, 2001 at 03:06 UTC
      You understand sort well, so there's not much efficiency to be gained, I think.
      However, I would have used some complex data structure for readability. This also removes the need for temporary variables. (but adds an array)
      my @data = sort { $a->[7] cmp $b->[7] || $b->[6] <=> $a->[6] || # '' == 0 $a->[0] <=> $b->[0] || $a->[1] <=> $b->[1] || $a->[2] <=> $b->[2] || $a->[4] cmp $b->[4] || $a->[3] cmp $b->[3] } map { [ /^(\d+)(\D+)(\d+)(\D*)R(\d+)(B?)(\d*)\.(\w+)$/ ] } <DATA>; print "$_->[0]$_->[1]$_->[2]$_->[3]R$_->[4]$_->[5]$_->[6].$_->[7]\n" f +or @data;
      I've made some mistake in there, but I leave it up to you to find and fix it. For some reason, my version puts 113B28R2.ras two places too high (top==0) in the array.

      I think my version is more readable, and threrefore more reliable and maintainable (if the bug is fixed, that is ;))

      2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$


        a couple points,

        The bug is that you sort the second type position numerically instead of lexicographically. However, i don't like the way you leave the elements in @data as array references when you're through sorting. You should have finished up the Schwartz Transform so that the original data was still there, just reorganized. If we make those corrections we come up with:

        my @data = map { join '', @$_[0..3],'R',@$_[4..6],'.',$_->[7], "\n" } sort { $a->[7] cmp $b->[7] || $b->[6] <=> $a->[6] || $a->[0] <=> $b->[0] || $a->[1] cmp $b->[1] || $a->[2] <=> $b->[2] || $a->[4] <=> $b->[4] || # update: changed (see 1) $a->[3] cmp $b->[3] } map { [ /^(\d+)(\D+)(\d+)(\D*)R(\d+)(B?)(\d*)\.(\w+)$/ ] } <DATA>;
        jynx

        update: Unfortunately, before i even got to post this, Ovid posted a better solution that makes sure all the values are defined before being sorted as well.

        1: this was changed due to Ovid's note...

    Re: complex sort
    by grinder (Bishop) on Dec 25, 2001 at 03:09 UTC
      I haven't taken the time to actually download and run this code, but it looks pretty good to me. A few suggestions:

      • You can chomp <DATA> straight away before passing it to the sort routine. The fact that you do so afterwards, and don't match whitespace before the $ anchor makes me wonder whether the regexp really matches anything.
        sort {...} chomp <DATA>;

        Juerd correctly points out that you can't chomp DATA. In actual fact, you don't even need to chomp at all. You're not actually doing anything to the records. Get rid of the chomp and the -l switch and be done with it.

      • The sort function looks sufficiently complicated to merit using a Schwartz Transform to perform the split only once (split to a list and sort on the different elements).
        The data you want to sort lend themselves beautifully to the Guttman Rosler Transform which is faster than the Schwartz Transform. (aside: the previous sentence shows clearly why the correct terminology is 'Schwartz Transform', not 'Schwartzian Transform'). Here is the code to do just that:
        print map { substr( $_, 16 ) } sort map { /^(\d+) # digits of type (\D+) # type character (\d+) # part (\D*) # detail (optional) R(\d+) # revision count B?(\d*) # alternate count (optional) \.(vec|ras) # file extension \s+$ # trailing whitespace /x ? sprintf( '%3s%03d%03d%2s%03d%1d%1s', # numbers add up to + 16 $7, 999 - ($6 || 0), $1, $2, $3, $5, $4, ) . $_ : ('x' x 16) . $_ } <DATA>;

        The idea is that you add a prefix to the data you want to sort, in order to be able employ a bare sort. Once the array hits the sort code, you are running a C speed until the sort is done. No more perl op-codes for this baby. At the other end of the sort, you throw away the prefix.

        Note how I create the inverse of the alt count so that the normal compare still works. (The sprintf may have to be tailored to suit). Also note how I create a dummy prefix in case the regexp fails. For debugging, comment out the map that strips off the prefix.

      • On the question of readability/maintainability, I would use the extended regexp syntax in order to comment what you're looking for.

      • If you can use the // idiom to represent a regexp, then do so. Using m!! is unsettling.

      • For a sort subroutine that big, name it. I.e.,
        sub part_sort { ... } sort part_sort <DATA>;
        At least that way you can then set a breakpoint easily with b part_sort to see why the silly thing isn't working.

      • You don't need the for block at all
        print sort part_sort <DATA>;
        will do the job just as well.

      • I don't presume to understand your job, but looking at the results, is the type subordinate to the layer or is it the other way around? I guess it's one of weirder naming schemes I have come across.

      • As for your documentation, simply refer to http://www.perlmonks.org/index.pl?node_id=134235 in the comments. :)
      --
      g r i n d e r
      just another bofh

      print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u';
        You can chomp <DATA> straight away before passing it to the sort routine. [...]
        [...]
        sort part_sort chomp <DATA>;
        <DATA> is immutable, so you cannot chomp it (chomp actually tries to modify what it gets - it returns the number of removed characters, not a list of chomped strings).

        #!/usr/bin/perl print chomp <DATA>; __DATA__ This piece of code will trigger the following compilation error: Can't modify <HANDLE> in chomp at - line 2, near "<DATA>;" Execution of test aborted due to compilation errors.

        2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

    Re: complex sort
    by mkmcconn (Chaplain) on Dec 25, 2001 at 06:16 UTC

      Thank you all very much for your helpful suggestions. I combined them to form the following (which even (accidentally) managed to beat Ovid's by a hair):
      mkmcconn: 4 wallclock secs ( 4.66 usr + 0.00 sys = 4.66 CPU) @ 1073.88/s (n=5000)
      ovid: 7 wallclock secs ( 7.10 usr + 0.00 sys = 7.10 CPU) @ 704.13/s (n=5000)

      @result1 = map{join '',@$_[0..4,6,8]} sort{ $a->[8] cmp $b->[8] || ($b->[7] eq '' ? 0 : $b->[7]) <=> ($a->[7] eq '' ? 0 : $a->[7] +) # didn't manage to eliminate this without warnings || $a->[0] <=> $b->[0] || $a->[1] cmp $b->[1] || $a->[2] <=> $b->[2] || $a->[5] <=> $b->[5] || $a->[3] cmp $b->[3] } map {[m/(\d+)(\D+)(\d+)(\D*)(R(\d+))(B?(\d*))(\.\w+)\s*$/] } <DATA +>;

      mkmcconn
      corrected order

        You simplify the second sort condition from using ?: to using ||:

        # this: ($b->[7] eq '' ? 0 : $b->[7]) <=> ($a->[7] eq '' ? 0 : $a->[7] # to this: ($b->[7]||0) <=> ($a->[7]||0)
    Re: complex sort
    by George_Sherston (Vicar) on Dec 25, 2001 at 03:34 UTC
      "Use a nested hash", I thought. It's one of those thoughts that sounds like it offers a neater solution than it does. This does the job, but it's a little contrived. Having said that, once you've got the stuff into the data structure you may find there are lots of other great things you can do with it, so there may be some merit here that I haven't spotted. See how you like:
      my %sort; for (<DATA>) { m!(\d+)(\D+)(\d+)(\D*)R(\d+)B?(\d*)\.(\w+)$!; $sort{$7}{$6}{"$1$2"}{$3}{$5}{$4} = $_; } for my $first (sort keys %sort) { for my $second (reverse keys %{$sort{$first}}) { for my $third (sort keys %{$sort{$first}{$second}}) { for my $fourth (sort keys %{$sort{$first}{$second}{$third} +}) { for my $fifth (sort keys %{$sort{$first}{$second}{$thi +rd}{$fourth}}) { for my $sixth (sort keys %{$sort{$first}{$second}{ +$third}{$fourth}{$fifth}}) { print $sort{$first}{$second}{$third}{$fourth}{ +$fifth}{$sixth}; } } } } } }
      This prints out
      18AD13XR0B3.ras 18DD13AR0B2.ras 5A344R2B15.ras 13B29AR0B1.ras 113A22R1.ras 113A29R0.ras 113A29AR0.ras 113B28R2.ras 5A32R0.ras 5A33R0.ras 5A34R0.ras 5A35R0.ras 5A35SR0.ras 18AD13XR0B3.vec 18DD12YR0B2.vec 18DD13AR0B2.vec 5A344R2B15.vec 113B29AR0B1.vec 113A22R1.vec 113A28R0.vec 113A28AR0.vec 113A28R1.vec 113A28R2.vec 113A28AR2.vec 113A29R1.vec 113A29AR2.vec 5A32R0.vec 5A33R0.vec 5A34R0.vec 5A35R0.vec
      ... if that's not exactly what you wanted, you can probably tweak the code. And maybe some greater monk than I can make the recursive unpacking of the hash a bit shorter.

      Merry Christmas!

      § George Sherston

    Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Domain Nodelet?
    Node Status?
    node history
    Node Type: perlquestion [id://134235]
    Approved by root
    help
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this?Last hourOther CB clients
    Other Users?
    Others cooling their heels in the Monastery: (5)
    As of 2024-03-29 15:06 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found