Inconsistent column Schwartzian Transform attempt

gryphon has asked for the wisdom of the Perl Monks concerning the following question:

Greetings all,

I had an interesting problem come up recently, and I thought, "This would be a great opportunity to really sit down and mentally absorb the coolness of the Schwartzian Transform." What I need to do is take a series of strings that can be split into columns and sort them based on each column. The problem is that the "columns" from each string will not be consistent.

What I've come up with thus far seems to work with all the test cases I've thrown against it. However, I have some serious concerns about certain parts, and I'm also a little worried about efficiency.

#!/usr/bin/perl -w
use strict;

my @unordered = (
  '12 Corinthians 5:10',
  'Hebrews 11:15',
  '1 Corinthians 13:23',
  '2 Corinthians 1:3',
  'John 3:16',
  '1 Corinthians 12:10',
  '1 Corinthians 2:10',
  '1 Corinthians 2:10'
);

my @ordered = map { $_->[0] }
sort {
  for (my $x=1; $x<$#{$a}+1; $x++) {
    if (($a->[$x] =~ /\D/) || ($b->[$x] =~ /\D/)) {
      my $compare = $a->[$x] cmp $b->[$x];
      return $compare if ($compare != 0);
    } else {
      my $compare = $a->[$x] <=> $b->[$x];
      return $compare if ($compare != 0);
    }
  }
  return 0;
}
map { [ $_, split(/[ :]/) ] }
@unordered;

print join "\n", @ordered;
exit;
[download]

One worry is with the upper limit to $x in the for loop. It only deals with $a. Then there's the inside of the for loop. I have this feeling that it's totally not efficient. Is there a nifty way to run the dual compare types without having to build the if? I added the $compare step so as to avoid having to run the compare twice in each case, but is there a better way? Ultimately, I'd like this to work with strings like "X 2 4 S w r y" where case may be important as well.

-gryphon
code('Perl') || die;

Comment on Inconsistent column Schwartzian Transform attempt Download Code

Replies are listed 'Best First'.
Re: Inconsistent column Schwartzian Transform attempt by dragonchild (Archbishop) on Mar 15, 2002 at 21:03 UTC
I have absolutely no idea what you're doing in your sort. I'm assuming that you want to sort on the following (in order): Book name Chapter Verse In that case, you don't want to do a split, but, rather, a regex. `map { [ $_, /(?:(\d+)\s+)?(\w)\s+(\d+):(\d+)/ ] }` [download] Then, in your sort, you now compare in the following: `sort { $a->[2] cmp $b->[2] \|\| ( defined $a->[1] && defined $b->[1] && $a->[1] <=> $b->[1]) \|\| $a->[2] <=> $b->[2] \|\| $a->[3] <=> $b->[3] }` [download] The reason for both defined checks is that you need to make sure that both sides have a number. You might have '1 Hebrews' vs. 'Hebrews'. You don't want to try and make that comparison. If you wanted, you could fake 'Hebrews' into '0 Hebrews' by improving the sort a bit to say `sort { $a->[2] cmp $b->[2] \|\| ($a->[1] \|\| 0) <=> ($b->[1] \|\| 0) \|\| $a->[2] <=> $b->[2] \|\| $a->[3] <=> $b->[3] }` [download] Putting it all together, you end up with: `@sorted = map { $_-[0] } sort { $a->[2] cmp $b->[2] \|\| ($a->[1] \|\| 0) <=> ($b->[1] \|\| 0) \|\| $a->[2] <=> $b->[2] \|\| $a->[3] <=> $b->[3] } map { [ $_, /(?:(\d+)\s+)?(\w)\s+(\d+):(\d+)/ ] } @unsorted;` [download] (Note - this is untested code. Caveat Emptor!) ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.	[reply] [d/l] [select]
Re: Inconsistent column Schwartzian Transform attempt by abstracts (Hermit) on Mar 15, 2002 at 21:01 UTC
Hello Your columns are kind of consistent: `my @unordered = ( '12 Corinthians 5:10', 'Hebrews 11:15', '1 Corinthians 13:23', '2 Corinthians 1:3', 'John 3:16', '1 Corinthians 12:10', '1 Corinthians 2:10', '1 Corinthians 2:10' );` [download] Basically, all your lines match `/^(\d+ )?(\D+) (\d+):(\d+)$/.` [download] Here is how it can be done assuming you want to sort by the first number, then by the word, then by the number before the ':', then finally by the number after the ':'. #!/usr/bin/perl -w my @unordered = ( '12 Corinthians 5:10', 'Hebrews 11:15', '1 Corinthians 13:23', '2 Corinthians 1:3', 'John 3:16', '1 Corinthians 12:10', '1 Corinthians 2:10', '1 Corinthians 2:10' ); my @sorted = map { $_->[0] } sort { ($a->[1]\|\|0) <=> ($b->[1]\|\|0) \|\| $a->[2] cmp $b->[2] \|\| $a->[3] <=> $b->[3] \|\| $a->[4] <=> $b->[4] } map { [$_, /^(\d+ )?([A-Za-z]+) (\d+):(\d+)$/] } @unordered; print "$_\n" for @sorted; # Hebrews 11:15 # John 3:16 # 1 Corinthians 2:10 # 1 Corinthians 2:10 # 1 Corinthians 12:10 # 1 Corinthians 13:23 # 2 Corinthians 1:3 # 12 Corinthians 5:10 [download] Hope this helps,,, Aziz,,,	[reply] [d/l] [select]
Re: Inconsistent column Schwartzian Transform attempt by RMGir (Prior) on Mar 15, 2002 at 21:01 UTC
It gets simpler if you normalize the data in the first step. `#!/usr/bin/perl -w use strict; my @unordered = ( '12 Corinthians 5:10', 'Hebrews 11:15', '1 Corinthians 13:23', '2 Corinthians 1:3', 'John 3:16', '1 Corinthians 12:10', '1 Corinthians 2:10', '1 Corinthians 2:10' ); my @ordered = map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [ $_, sprintf("%05d %-20s %04d %04d",/^\d/?():(99999), split(/[ +:]/)) ] } @unordered; print join "\n", @ordered; exit;` [download] By adding the `/^\d/?():(99999)` [download] and sprintf'ing the string so it compares lexicographically, you're in good shape, I think. If you had a huge set of data to work with, you might want to do it this way instead, which I think might be faster (and might be the GR transform, I'm not sure): `# ... my @ordered = map { (split /\t/)[1] } sort { $a cmp $b } map { sprintf("%05d %-20s %04d %04d", /^\d/?():(99999), split(/[ :]/)) ."\t$_" } @unordered; # ...` [download] Make sense? -- Mike (Edit: removed debugging prints from sort block)	[reply] [d/l] [select]
And lexicographically means... by RMGir (Prior) on Mar 16, 2002 at 16:36 UTC
...make sure you sprintf numbers with %0xxd, where the xx is enough space for the widest number in that column, and make sure you format strings with %-yys where yy is enough space for the widest string in that column. That way, the fields will "cmp" correctly. If you need keys sorted in a different order, just sprintf them differently. For descending numeric keys, sprintf 9999999-$fld. For descending string keys... Hmmm, don't have an easy answer to that one. Of course, as long as you don't need some string keys desc and others asc, you can just reverse the order of the cmp call in the sort block. Hope this helps! -- Mike	[reply]
Re: And lexicographically means... by Juerd (Abbot) on Mar 16, 2002 at 18:17 UTC
If you need keys sorted in a different order, just sprintf them differently. For descending numeric keys, sprintf 9999999-$fld. For descending string keys... Hmmm, don't have an easy answer to that one. `# Ascending sort { $a cmp $b } LIST sort { $a <=> $b } LIST # Descending sort { $b cmp $a } LIST sort { $b <=> $a } LIST` [download] I think using 9999999-$foo is an ugly way to do a descending sort, and I think sprintfing to sort numbers using cmp is evil, because there's already the <=> operator and you can use every routine you want for sorting - just reverse it if you want a descending sort. U28geW91IGNhbiBhbGwgcm90MTMgY W5kIHBhY2soKS4gQnV0IGRvIHlvdS ByZWNvZ25pc2UgQmFzZTY0IHdoZW4 geW91IHNlZSBpdD8gIC0tIEp1ZXJk	[reply] [d/l]
It's an efficiency thing, not an elegance thing by RMGir (Prior) on Mar 16, 2002 at 19:41 UTC
Re: And lexicographically means... by I0 (Priest) on Mar 16, 2002 at 18:46 UTC
For descending string keys sprintf ~$fld	[reply]