Re: Speeding up sort routines

Replies are listed 'Best First'.
Re: Re: Speeding up sort routines by arhuman (Vicar) on Jul 02, 2001 at 21:30 UTC
Just beccause Davorg didn't provide the link when mentionning "Guttman-Rosler Transform"... You'll find here the BEST ARTICLE ever written on sorting with Perl... (about Schwartzian Transform, Orcish Maneuvers, packed-string sortkey...) You'll not only learn all the most efficient techniques you'll also understand why they're faster. And why bwana147 suggestion is a good idea which can often be generalized... I ALWAYS cite it when there's a thread about sorting, and I really DO believe it's a MUST READ. (and that it must be on the tutorials page or Q&A or something like that...) "Only Bad Coders Code Badly In Perl" (OBC2BIP)	[reply]
Re: Re: Re: Speeding up sort routines by petral (Curate) on Jul 02, 2001 at 23:19 UTC
Really wandering from the immediate question, but for some other variations see also Re: Substring Sort. p	[reply]
Re: Re: Speeding up sort routines by nysus (Parson) on Jul 02, 2001 at 19:45 UTC
davorg: Date format: Jun 28, 2001 Time format: 16:46:36 $PM = "Perl Monk's"; $MCF = "Most Clueless ~~Friar~~ Abbot"; $nysus = $PM . $MCF; Click here if you love Perl Monks	[reply]
Re: Re: Re: Speeding up sort routines by davorg (Chancellor) on Jul 02, 2001 at 19:59 UTC
In that case, changing the comparison operator has changed the way your sort works (effectively breaking it). To demonstrate, try this code: `($a, $b) = ( '16:46:36', '16:00:00'); print 'cmp : ', $a cmp $b, "\n"; print '<=> : ', $a <=> $b, "\n";` [download] This prints `cmp : 1 <=> : 0` [download] Which shows that Perl thinks that the two strings are the same when using `<=>`. When doing a string comparison, Perl uses all of the characters in the string, but when doing a numeric comparison, it can only use the digits at the start of the string. It stops when it finds the first non-digit. Had you been using `-w` then you would have got a warning explaining what you were doing wrong. Also, doing a string comparison on dates in the format "Jun 28, 2001" is a bad idea as it will do an alphabetical sort - not in date order. I'd recommend doing a first pass on the data and converting it to YYYYMMDDhhmm, which you can then do a numeric sort on. -- <http://www.dave.org.uk> Perl Training in the UK <http://www.iterative-software.com>	[reply] [d/l] [select]
Re: Re: Re: Re: Speeding up sort routines by Anonymous Monk on Jul 02, 2001 at 20:31 UTC
I'm pointing this out even though you hinted at it. The reason the sort is so much faster with <=> is because perl thinks the two strings are equal and with all elements in a list being equal no swaps have to be made when sorting and with perl's sort algorithm less comparisons have to be made.	[reply]
Re: Re: Re: Speeding up sort routines by lhoward (Vicar) on Jul 02, 2001 at 19:53 UTC
You're going to want to parse the date format into something nicer before you compare them. Comparing them as they are will not give the results you want: Comparing "Jan 15, 2000" to "Apr 12, 2001" with cmp will tell you that the April, 2001 date is before the Jan 2000 date. My preference would be to combine the date and timestamp and convert it to a unix timestamp (seconds since Jan 1, 1970). Then you can just compare the timestamps numericly. I like to use Time::ParseDate for these types of conversions. Also, using split instead of the regular expression for your IP #'s will probably be quicker.	[reply]
Re: Re: Re: Speeding up sort routines by Anonymous Monk on Jul 02, 2001 at 20:00 UTC
How did you actually want to sort your data then? If you wanted to sort in date order neither a <=> or a cmp will be appropriate for that date format. You could try using a yyyymmdd type format or convert the date and time into an epoch time (and sort that with <=>) or look at any functionality provided by any Date modules. (I'm not really familiar with any of these, so I'd be interested to hear about any date sorting they might provide)	[reply]
Re: Re: Re: Speeding up sort routines by bwana147 (Pilgrim) on Jul 02, 2001 at 20:07 UTC
Such a date format should never be used for anything but printing out. Prefer yyyymmdd, it's simpler to sort. Anyway, explicitly converting your dates is always a good idea: it ensures you don't depend on locales or on the American stupid middle-endian format. --bwana147	[reply]