in reply to Speeding up sort routines

What format are your date and time elements stored in?

This sounds like an occasion where a Schwartzian Transform or (better) a Guttman-Rosler Transform could get that time down quite a bit more. But I'd need more details before writing one.

--
<http://www.dave.org.uk>

Perl Training in the UK <http://www.iterative-software.com>

Replies are listed 'Best First'.
Re: Re: Speeding up sort routines
by arhuman (Vicar) on Jul 02, 2001 at 21:30 UTC
    Just beccause Davorg didn't provide the link when mentionning
    "Guttman-Rosler Transform"...
    You'll find here the BEST ARTICLE ever written on sorting with Perl...
    (about Schwartzian Transform, Orcish Maneuvers, packed-string sortkey...)

    You'll not only learn all the most efficient techniques you'll also understand why they're faster.
    And why bwana147 suggestion is a good idea which can often be generalized...

    I ALWAYS cite it when there's a thread about sorting,
    and I really DO believe it's a MUST READ.
    (and that it must be on the tutorials page or Q&A or something like that...)


    "Only Bad Coders Code Badly In Perl" (OBC2BIP)
      Really wandering from the immediate question, but for some other variations see also Re: Substring Sort.

        p

Re: Re: Speeding up sort routines
by nysus (Parson) on Jul 02, 2001 at 19:45 UTC

      In that case, changing the comparison operator has changed the way your sort works (effectively breaking it).

      To demonstrate, try this code:

      ($a, $b) = ( '16:46:36', '16:00:00'); print 'cmp : ', $a cmp $b, "\n"; print '<=> : ', $a <=> $b, "\n";

      This prints

      cmp : 1 <=> : 0

      Which shows that Perl thinks that the two strings are the same when using <=>. When doing a string comparison, Perl uses all of the characters in the string, but when doing a numeric comparison, it can only use the digits at the start of the string. It stops when it finds the first non-digit. Had you been using -w then you would have got a warning explaining what you were doing wrong.

      Also, doing a string comparison on dates in the format "Jun 28, 2001" is a bad idea as it will do an alphabetical sort - not in date order.

      I'd recommend doing a first pass on the data and converting it to YYYYMMDDhhmm, which you can then do a numeric sort on.

      --
      <http://www.dave.org.uk>

      Perl Training in the UK <http://www.iterative-software.com>

        I'm pointing this out even though you hinted at it. The reason the sort is so much faster with <=> is because perl thinks the two strings are equal and with all elements in a list being equal no swaps have to be made when sorting and with perl's sort algorithm less comparisons have to be made.
      You're going to want to parse the date format into something nicer before you compare them. Comparing them as they are will not give the results you want:

      Comparing "Jan 15, 2000" to "Apr 12, 2001" with cmp will tell you that the April, 2001 date is before the Jan 2000 date.

      My preference would be to combine the date and timestamp and convert it to a unix timestamp (seconds since Jan 1, 1970). Then you can just compare the timestamps numericly. I like to use Time::ParseDate for these types of conversions.

      Also, using split instead of the regular expression for your IP #'s will probably be quicker.

      How did you actually want to sort your data then? If you wanted to sort in date order neither a <=> or a cmp will be appropriate for that date format.
      You could try using a yyyymmdd type format or convert the date and time into an epoch time (and sort that with <=>) or look at any functionality provided by any Date modules. (I'm not really familiar with any of these, so I'd be interested to hear about any date sorting they might provide)

      Such a date format should never be used for anything but printing out. Prefer yyyymmdd, it's simpler to sort.

      Anyway, explicitly converting your dates is always a good idea: it ensures you don't depend on locales or on the American stupid middle-endian format.

      --bwana147