Sorting Numbers & Text

PriNet has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Sorting Numbers & Text by davido (Cardinal) on Jul 12, 2012 at 23:24 UTC
It's not as easy as it used to be... and actually it never worked all that well. We've always had to deal with the problem of "Rabbit" being sorted above "apple" because of case issues (that could be dealt with by normalizing the case inside of the sort routine, and then falling back to case-sensitive sorting in case of equality). And we've always had the issue of string comparisons of numbers numerically incorrectly. But now we also have to worry about Unicode issues. Additionally, you do want numbers to be sorted numerically, and alpha characters to be sorted alphabetically, I believe. I think that most dictionary style sorts put numbers ahead of alpha characters as well (you seemed to be looking for the opposite, but I'll ignore that for a moment). Here's one way to do it that should be Unicode safe: use strict; use warnings; use utf8; use Unicode::Collate::Locale; use Scalar::Util 'looks_like_number'; use feature qw/say unicode_strings/; binmode STDOUT, ':utf8'; my @unsorted = qw( 041351920234 Rabbit 0343120 041271024500 000000343119 0430870 Apple 041460301399 ); my $collator = Unicode::Collate::Locale->new(locale => 'en'); my @sorted = sort { ( looks_like_number($a) && looks_like_number($b) && $a <=> $b ) \|\| $collator->getSortKey($a) cmp $collator->getSortKey($b) } @unsorted; say for @sorted; [download] Pretty ugly, and still falls back to alphabetical sorting when a given string has non-numeric characters embedded within it. I'd love for someone to come along and show a better way to do it, as this just seems messy. Dave	[reply] [d/l]
Re: Sorting Numbers & Text by johngg (Canon) on Jul 12, 2012 at 21:21 UTC
Do the sort using a Schwartzian Transform by storing an indicator on whether the value is alpha or numeric as well as the value itself. The outer sort is alpha then numeric and the inner sort comparator is chosen in a ternary depending on the indicator. `knoppix@Microknoppix:~$ perl -E ' > my @values = qw{ > 041351920234 > Rabbit > 0343120 > 041271024500 > 0430870 > Apple > 041460301399 > }; > > say for > map { $_->[ 0 ] } > sort { > $a->[ 1 ] <=> $b->[ 1 ] > \|\| > $a->[ 1 ] > ? $a->[ 0 ] <=> $b->[ 0 ] > : $a->[ 0 ] cmp $b->[ 0 ] > } > map { [ $_, m{^\d} ? 1 : 0 ] } > @values;' Apple Rabbit 0343120 0430870 041271024500 041351920234 041460301399 knoppix@Microknoppix:~$` [download] I hope this is helpful. Cheers, JohnGG	[reply] [d/l]
Re^2: Sorting Numbers & Text by frozenwithjoy (Priest) on Jul 12, 2012 at 21:36 UTC
I get the same result with: `#!/usr/bin/env perl use strict; use warnings; use Data::Printer; my @unsorted = qw( 041351920234 Rabbit 0343120 041271024500 0430870 Ap +ple 041460301399); my @sorted = sort { $a <=> $b \|\| $a cmp $b } @unsorted; p @unsorted; p @sorted;` [download] Is your way more efficient memory-wise? Also, both of our approaches are giving me warnings like: `Argument "Rabbit" isn't numeric in numeric comparison (<=>) at...`.	[reply] [d/l] [select]
Re^3: Sorting Numbers & Text by johngg (Canon) on Jul 12, 2012 at 22:10 UTC
It seems that the warnings are due to a precedence problem that can be fixed with parentheses, `knoppix@Microknoppix:~$ perl -Mstrict -Mwarnings -E ' > my @values = qw{ > 041351920234 > Rabbit > 0343120 > 041271024500 > 0430870 > Apple > 041460301399 > }; > > say for > map { $_->[ 0 ] } > sort { > ( $a->[ 1 ] <=> $b->[ 1 ] ) > \|\| > ( > $a->[ 1 ] > ? $a->[ 0 ] <=> $b->[ 0 ] > : $a->[ 0 ] cmp $b->[ 0 ] > ) > } > map { [ $_, m{^\d} ? 1 : 0 ] } > @values;' Apple Rabbit 0343120 0430870 041271024500 041351920234 041460301399 knoppix@Microknoppix:~$` [download] Thanks for pointing it out. Cheers, JohnGG	[reply] [d/l]
Re^2: Sorting Numbers & Text by PriNet (Monk) on Jul 12, 2012 at 21:40 UTC
I start with a "new" value that isn't in the "list(file)" yet (either alpha or numeric), then start reading out the values one at a time from the sourcefile, compare the two, print the "smallest" back to a tempfile while retaining the largest for the next read... and so on... (kind-of-a-bubble-sort) then swap the files back at the end. its the "comparing" of mixed types that i get stuck on. and i'm worried about running out of memory if i just load into array and <=> sort function (which i may end up doing heh) but you have given me an idea about "flagging" the first character as alpha or numeric ... hmmmm I did try re-inventing the wheel... But it kept getting stuck on the corners	[reply]
Re^3: Sorting Numbers & Text by johngg (Canon) on Jul 12, 2012 at 22:25 UTC
Another approach might be to read chunks of your very large database, perhaps 100k to 500k records at a time, and sort each chunk into its own temporary file. Once you have read and sorted all of the data, do a sort/merge of the temporary files into a final sorted file. My gut feeling is that this would be more efficient than the "two at a time" approach you are taking. Cheers, JohnGG	[reply]
Re^3: Sorting Numbers & Text by johngg (Canon) on Jul 13, 2012 at 18:41 UTC
Thinking about it further, read a chunk of your database then sort and print to two temporary files, one for letters, one for numbers. Then the sort/merge of the temporary files will be simpler keeping the two categories separate. Finally you can concatenate the letters and numbers merged files for your results file. In this code I am writing to in-memory scalars rather than disk files just to keep things tidy. knoppix@Microknoppix:~$ perl -Mstrict -Mwarnings -E ' > my @values = qw{ > 041351920234 > Rabbit > 0343120 > 041271024500 > 0430870 > Apple > 041460301399 > }; > > my $rsLets = do { \ my $lets }; > open my $letsFH, q{>}, $rsLets or die $!; > my $rsNums = do { \ my $nums }; > open my $numsFH, q{>}, $rsNums or die $!; > > say { $_->[ 1 ] ? $numsFH : $letsFH } $_->[ 0 ] for > sort { > ( $a->[ 1 ] <=> $b->[ 1 ] ) > \|\| > ( > $a->[ 1 ] > ? $a->[ 0 ] <=> $b->[ 0 ] > : $a->[ 0 ] cmp $b->[ 0 ] > ) > } > map { [ $_, m{^\d} ? 1 : 0 ] } > @values; > > say ${ $rsLets }, q{-----------------}; > say ${ $rsNums }, q{-----------------};' Apple Rabbit ----------------- 0343120 0430870 041271024500 041351920234 041460301399 ----------------- knoppix@Microknoppix:~$ [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l]
Re: Sorting Numbers & Text by dasgar (Priest) on Jul 12, 2012 at 21:29 UTC
...because I'm using a very large database and don't want to have "not-enough" ram/memory problems. Since you're using data from a "very large database", it would seem that you're doing some kind of SQL query to retrieve the data from that database. If that's true, would it be possible to add a "sort by" clause to the SQL query to get the database to doing the sorting for you?	[reply]
Re^2: Sorting Numbers & Text by PriNet (Monk) on Jul 12, 2012 at 21:50 UTC
uhhmmm.... you know? if i knew how to do sql, i probably wouldn't have half the problems i get heh, but alas, i have a brain block everytime i try to learn it... kinda like learning linux, which also would fix alot of my problems heh I did try re-inventing the wheel... But it kept getting stuck on the corners	[reply]
Re^3: Sorting Numbers & Text by PriNet (Monk) on Jul 12, 2012 at 22:50 UTC
ok, heres one to really twist your coaxial pair... if (($OldItem > $NewItem) \|\| (($OldItem == $NewItem) && ($OldItem gt $NewItem))) {} the first catches and "sorts" numbers... \|\| ... the second checks to see if "alpha" and sorts texts... watcha think? i know there "has" to be a flaw...heh I did try re-inventing the wheel... But it kept getting stuck on the corners	[reply]
Re^2: Sorting Numbers & Text by pvaldes (Chaplain) on Jul 13, 2012 at 15:33 UTC
First of all index your database if not done yet, (or vacuum, analyze and refresh your existent indexes if necessary) and then ask your database to sort your results as indicated by dasgar. if i knew how to do sql, i probably wouldn't have half the problems i get... Don't panic. Take a look to the sql keyword: "order by"	[reply]
Re: Sorting Numbers & Text by kcott (Archbishop) on Jul 13, 2012 at 06:18 UTC
This gives the output you wanted: `$ perl -Mstrict -Mwarnings -MScalar::Util=looks_like_number -E ' my @values = qw{041351920234 Rabbit 0343120 041271024500 0430870 Apple + 041460301399}; say for sort { looks_like_number($a) && looks_like_number($b) ? $a <=> $b : looks_like_number($a) ? 1 : looks_like_number($b) ? -1 : $a cmp $b } @values; ' Apple Rabbit 0343120 0430870 041271024500 041351920234 041460301399` [download] -- Ken	[reply] [d/l]
Re: Sorting Numbers & Text by aaron_baugher (Curate) on Jul 13, 2012 at 01:50 UTC
This seems a little like cheating, but one way around the conflict would be to split your database lookup in twain: `select myid from mytable where myid rlike '^[A-Z].' order by myid; select myid from mytable where myid rlike '^[0-9].' order by abs(myid +);` [download] Process them in that order, and you'll have your alpha ones ahead of your numerical ones, and both already sorted -- case-insensitive string sort on the first command, and numeric sort on the second. If you need to do any further sorting on either batch, it should be simpler. Also, letting the database engine order them should be faster than doing it in Perl after fetching them, especially if you put an index on the column in question. Aaron B. Available for small or large Perl jobs; see my home node.	[reply] [d/l]