Re: Slow at sorting?

clintp, sorry about just listing the file size I should have known better. The log file I was testing on is a total of 1,401,986 lines with 1,401,950 matching my criteria.

I gave a sample of the matching lines in my previous post, but incase you missed it here you go.

CD1\01100809.pdf(1) - [Account Number] Indexed key "654546654".
CD2\01100809.pdf(1) - [Invoice Date] Indexed key "10/08/2001".
CD1\01100809.pdf(1) - [Customer Name] Indexed key "FOOBAR".
CD2\01100809.pdf(1) - [Contact Name] Indexed key "Dr. FOO".
CD4\01100809.pdf(20) - [Account Number] Indexed key "54356564".
[download]

If you really want the full logfile I could strip out all sensitive data and throw it out somewhere where you can grab it, if this is something you really want to play with let me know...

Comment on Re: Slow at sorting? Download Code

Replies are listed 'Best First'.
Re: Re: Slow at sorting? by dws (Chancellor) on Nov 22, 2001 at 01:47 UTC
`CD1\01100809.pdf(1) - [Account Number] Indexed key "654546654".` According to an earlier post, the bold fields are the ones you're sorting on. If that's truly the case, and if the text within brackets comes from a limited dictionary, then this sort can be done entirely with numbers. First, combine the CD number with the file number, yielding (in this case) `101100809` Then map "Account Number" into its pre-determined sequence number. You're then left sorting something like `[ 101100809, 47, seek-address ]` on the first two fields, which are now numbers.	[reply] [d/l] [select]
Re: Slow at sorting? by guha (Priest) on Nov 22, 2001 at 02:42 UTC
How about extending it to something like pack("n N n", $1, $2, $3).$4 thereby we would sort CD# > 9 correctly, decrease the memory footprint and allow for the fast intrinsic ASCIIbetical sort. Perhaps even to the point where fast hash lookups can be used. It's a pity the data is in practice unavailable Here's what I would like to time ... `. . my (%sort_hash); my $offset = 0; while (<IN>) { if( m/^CD(\d+)\$\d+)\.pdf\((\d+)$\s-\s\[(.+?)\]/ ) { $sort_hash{pack("n N n", $1, $2, $3).$4} = $offset; } $offset = tell(IN); } foreach my $k (sort keys %sort_hash) { seek IN, $sort_hash{$k}, 0; print OUT scalar(<IN>); }` [download]	[reply] [d/l]
Re: Re: Slow at sorting? by petral (Curate) on Nov 22, 2001 at 03:56 UTC
And then, why not pack the offset on the end, and unpack substr -4 when you're done. Then you're down to a simple array sort and the only problem is the original one: sooner or later you scale up till you're paging forever. p	[reply]
Re: Re: Re: Slow at sorting? by guha (Priest) on Nov 22, 2001 at 04:08 UTC


Welcome to the Monastery
	PerlMonks