Re: Sorting Numbers & Text
by davido (Cardinal) on Jul 12, 2012 at 23:24 UTC
|
It's not as easy as it used to be... and actually it never worked all that well. We've always had to deal with the problem of "Rabbit" being sorted above "apple" because of case issues (that could be dealt with by normalizing the case inside of the sort routine, and then falling back to case-sensitive sorting in case of equality). And we've always had the issue of string comparisons of numbers numerically incorrectly. But now we also have to worry about Unicode issues.
Additionally, you do want numbers to be sorted numerically, and alpha characters to be sorted alphabetically, I believe. I think that most dictionary style sorts put numbers ahead of alpha characters as well (you seemed to be looking for the opposite, but I'll ignore that for a moment).
Here's one way to do it that should be Unicode safe:
use strict;
use warnings;
use utf8;
use Unicode::Collate::Locale;
use Scalar::Util 'looks_like_number';
use feature qw/say unicode_strings/;
binmode STDOUT, ':utf8';
my @unsorted = qw(
041351920234
Rabbit
0343120
041271024500
000000343119
0430870
Apple
041460301399
);
my $collator = Unicode::Collate::Locale->new(locale => 'en');
my @sorted = sort {
( looks_like_number($a) && looks_like_number($b) && $a <=> $b )
|| $collator->getSortKey($a) cmp $collator->getSortKey($b)
} @unsorted;
say for @sorted;
Pretty ugly, and still falls back to alphabetical sorting when a given string has non-numeric characters embedded within it.
I'd love for someone to come along and show a better way to do it, as this just seems messy.
| [reply] [d/l] |
Re: Sorting Numbers & Text
by johngg (Canon) on Jul 12, 2012 at 21:21 UTC
|
Do the sort using a Schwartzian Transform by storing an indicator on whether the value is alpha or numeric as well as the value itself. The outer sort is alpha then numeric and the inner sort comparator is chosen in a ternary depending on the indicator.
knoppix@Microknoppix:~$ perl -E '
> my @values = qw{
> 041351920234
> Rabbit
> 0343120
> 041271024500
> 0430870
> Apple
> 041460301399
> };
>
> say for
> map { $_->[ 0 ] }
> sort {
> $a->[ 1 ] <=> $b->[ 1 ]
> ||
> $a->[ 1 ]
> ? $a->[ 0 ] <=> $b->[ 0 ]
> : $a->[ 0 ] cmp $b->[ 0 ]
> }
> map { [ $_, m{^\d} ? 1 : 0 ] }
> @values;'
Apple
Rabbit
0343120
0430870
041271024500
041351920234
041460301399
knoppix@Microknoppix:~$
I hope this is helpful.
| [reply] [d/l] |
|
|
I get the same result with:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Printer;
my @unsorted = qw( 041351920234 Rabbit 0343120 041271024500 0430870 Ap
+ple 041460301399);
my @sorted = sort { $a <=> $b || $a cmp $b } @unsorted;
p @unsorted;
p @sorted;
Is your way more efficient memory-wise? Also, both of our approaches are giving me warnings like: Argument "Rabbit" isn't numeric in numeric comparison (<=>) at.... | [reply] [d/l] [select] |
|
|
knoppix@Microknoppix:~$ perl -Mstrict -Mwarnings -E '
> my @values = qw{
> 041351920234
> Rabbit
> 0343120
> 041271024500
> 0430870
> Apple
> 041460301399
> };
>
> say for
> map { $_->[ 0 ] }
> sort {
> ( $a->[ 1 ] <=> $b->[ 1 ] )
> ||
> (
> $a->[ 1 ]
> ? $a->[ 0 ] <=> $b->[ 0 ]
> : $a->[ 0 ] cmp $b->[ 0 ]
> )
> }
> map { [ $_, m{^\d} ? 1 : 0 ] }
> @values;'
Apple
Rabbit
0343120
0430870
041271024500
041351920234
041460301399
knoppix@Microknoppix:~$
Thanks for pointing it out.
| [reply] [d/l] |
|
|
| [reply] |
|
|
Another approach might be to read chunks of your very large database, perhaps 100k to 500k records at a time, and sort each chunk into its own temporary file. Once you have read and sorted all of the data, do a sort/merge of the temporary files into a final sorted file. My gut feeling is that this would be more efficient than the "two at a time" approach you are taking.
| [reply] |
|
|
Thinking about it further, read a chunk of your database then sort and print to two temporary files, one for letters, one for numbers. Then the sort/merge of the temporary files will be simpler keeping the two categories separate. Finally you can concatenate the letters and numbers merged files for your results file. In this code I am writing to in-memory scalars rather than disk files just to keep things tidy.
knoppix@Microknoppix:~$ perl -Mstrict -Mwarnings -E '
> my @values = qw{
> 041351920234
> Rabbit
> 0343120
> 041271024500
> 0430870
> Apple
> 041460301399
> };
>
> my $rsLets = do { \ my $lets };
> open my $letsFH, q{>}, $rsLets or die $!;
> my $rsNums = do { \ my $nums };
> open my $numsFH, q{>}, $rsNums or die $!;
>
> say { $_->[ 1 ] ? $numsFH : $letsFH } $_->[ 0 ] for
> sort {
> ( $a->[ 1 ] <=> $b->[ 1 ] )
> ||
> (
> $a->[ 1 ]
> ? $a->[ 0 ] <=> $b->[ 0 ]
> : $a->[ 0 ] cmp $b->[ 0 ]
> )
> }
> map { [ $_, m{^\d} ? 1 : 0 ] }
> @values;
>
> say ${ $rsLets }, q{-----------------};
> say ${ $rsNums }, q{-----------------};'
Apple
Rabbit
-----------------
0343120
0430870
041271024500
041351920234
041460301399
-----------------
knoppix@Microknoppix:~$
I hope this is of interest.
| [reply] [d/l] |
Re: Sorting Numbers & Text
by dasgar (Priest) on Jul 12, 2012 at 21:29 UTC
|
...because I'm using a
very large database and don't want to have "not-enough"
ram/memory problems.
Since you're using data from a "very large database", it would seem that you're doing some kind of SQL query to retrieve the data from that database. If that's true, would it be possible to add a "sort by" clause to the SQL query to get the database to doing the sorting for you?
| [reply] |
|
|
| [reply] |
|
|
| [reply] |
|
|
First of all index your database if not done yet, (or vacuum, analyze and refresh your existent indexes if necessary) and then ask your database to sort your results as indicated by dasgar.
if i knew how to do sql, i probably wouldn't have half the problems i get...
Don't panic. Take a look to the sql keyword: "order by"
| [reply] |
Re: Sorting Numbers & Text
by kcott (Archbishop) on Jul 13, 2012 at 06:18 UTC
|
$ perl -Mstrict -Mwarnings -MScalar::Util=looks_like_number -E '
my @values = qw{041351920234 Rabbit 0343120 041271024500 0430870 Apple
+ 041460301399};
say for sort {
looks_like_number($a) && looks_like_number($b) ? $a <=> $b
: looks_like_number($a) ? 1
: looks_like_number($b) ? -1
: $a cmp $b
} @values;
'
Apple
Rabbit
0343120
0430870
041271024500
041351920234
041460301399
| [reply] [d/l] |
Re: Sorting Numbers & Text
by aaron_baugher (Curate) on Jul 13, 2012 at 01:50 UTC
|
This seems a little like cheating, but one way around the conflict would be to split your database lookup in twain:
select myid from mytable where myid rlike '^[A-Z].*' order by myid;
select myid from mytable where myid rlike '^[0-9].*' order by abs(myid
+);
Process them in that order, and you'll have your alpha ones ahead of your numerical ones, and both already sorted -- case-insensitive string sort on the first command, and numeric sort on the second. If you need to do any further sorting on either batch, it should be simpler. Also, letting the database engine order them should be faster than doing it in Perl after fetching them, especially if you put an index on the column in question.
Aaron B.
Available for small or large Perl jobs; see my home node.
| [reply] [d/l] |