Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Re: Slow at sorting?

by orbital (Scribe)
on Nov 22, 2001 at 01:31 UTC ( #126863=note: print w/replies, xml ) Need Help??

in reply to Slow at sorting?

clintp, sorry about just listing the file size I should have known better. The log file I was testing on is a total of 1,401,986 lines with 1,401,950 matching my criteria.

I gave a sample of the matching lines in my previous post, but incase you missed it here you go.

CD1\01100809.pdf(1) - [Account Number] Indexed key "654546654". CD2\01100809.pdf(1) - [Invoice Date] Indexed key "10/08/2001". CD1\01100809.pdf(1) - [Customer Name] Indexed key "FOOBAR". CD2\01100809.pdf(1) - [Contact Name] Indexed key "Dr. FOO". CD4\01100809.pdf(20) - [Account Number] Indexed key "54356564".

If you really want the full logfile I could strip out all sensitive data and throw it out somewhere where you can grab it, if this is something you really want to play with let me know...

Replies are listed 'Best First'.
Re: Re: Slow at sorting?
by dws (Chancellor) on Nov 22, 2001 at 01:47 UTC
    CD1\01100809.pdf(1) - [Account Number] Indexed key "654546654".

    According to an earlier post, the bold fields are the ones you're sorting on. If that's truly the case, and if the text within brackets comes from a limited dictionary, then this sort can be done entirely with numbers.

    First, combine the CD number with the file number, yielding (in this case)   101100809 Then map "Account Number" into its pre-determined sequence number. You're then left sorting something like   [ 101100809, 47, seek-address ] on the first two fields, which are now numbers.

      How about extending it to something like

      pack("n N n", $1, $2, $3).$4

      thereby we would sort CD# > 9 correctly, decrease the memory footprint and allow for the fast intrinsic ASCIIbetical sort.
      Perhaps even to the point where fast hash lookups can be used.

      It's a pity the data is in practice unavailable

      Here's what I would like to time ...

      . . my (%sort_hash); my $offset = 0; while (<IN>) { if( m/^CD(\d+)\\(\d+)\.pdf\((\d+)\)\s-\s\[(.+?)\]/ ) { $sort_hash{pack("n N n", $1, $2, $3).$4} = $offset; } $offset = tell(IN); } foreach my $k (sort keys %sort_hash) { seek IN, $sort_hash{$k}, 0; print OUT scalar(<IN>); }
        And then, why not pack the offset on the end, and unpack substr -4 when you're done. Then you're down to a simple array sort and the only problem is the original one: sooner or later you scale up till you're paging forever.


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://126863]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2023-09-26 22:58 GMT
Find Nodes?
    Voting Booth?

    No recent polls found