Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Slow at sorting?

by orbital (Scribe)
on Nov 21, 2001 at 23:59 UTC ( [id://126837]=perlquestion: print w/replies, xml ) Need Help??

orbital has asked for the wisdom of the Perl Monks concerning the following question:

After posting my example of a Schwartzian Transform in Complex Sorting Optimization? for suggestions on optimization I made a few more discoveries.

I need to Thank tye for his sorting alternative, by building a single sort key and then sort of that, instead of or`ing my sort priority. I instantly notice 4 times the performance working on a 3MB file.

This enlightnment from tye made me question what other ways can I sort by and how can I sort faster, that is when I found A Fresh Look at Efficient Perl Sorting by Uri Guttman and Larry Rosler. This paper is incredible, if you have any questions about your own sorting techniques go read this paper now!!!

Taking my new sense of enlightnment I used tye code example to try and sort a very large log file. My test logfile was 100MB, my first problem was memory, loading the whole log into memory eventually sucked all my resources (382 MB of ram), since I was also building sort arrays and then an index array. So I made a very slight modification by making an array with references to the matching line not the line itself, I did this by getting the offset of the line. This saved me quite a bit of memory overhead at least enough to run the script without hitting virtual memory. Here is the new version of the code snippet:

my $infile = $ARGV[0]; my $outfile = $infile; $outfile =~ s/.{3}$/out/i; open(IN, $infile ) || die "can't open logfile: $infile\n"; open(OUT, ">$outfile" ) || die "can't create datfile: $outfile\n"; my ($offset, @sorton); while (<IN>) { if( m/^(.+?)\((\d+)\)\s-\s\[(.+?)\].+?"(.*?)"\.$/ ) { push @sorton, $1.pack("N",$2).$3; push @lines, $offset; 1; } $offset = tell(IN); } my @idx= sort { $sorton[$a] cmp $sorton[$b] } 0..$#sorton; for( @lines[@idx] ) { seek IN, $_, 0; print OUT scalar(<IN>); }

Yes I know I should be putting an exclusive lock on my file esp.since I am keying off the line offsets.

What I found from running this script on the large file was horrific. It took over 12 hours to sort this file! With a few very crued checks I found that the bottleneck was the line where I sort into @idx. All the steps above are completed within minutes, and the write step is done in less time then that. I guess my frusration is that this code sample was ported into C++, using a lot of the same techinques and the whole 100MB file ran in 2 minutes!!!! Thats a huge difference.

Does perl built in sort have that much overhead? Would I be better off not using the built in sort at all?

Replies are listed 'Best First'.
Re (tilly) 1: Slow at sorting?
by tilly (Archbishop) on Nov 22, 2001 at 02:06 UTC
    My recommendation for data in this size if you want to do your own sorting is to use DB_File's BTree method. Right now tilly's scratchpad has the following snippet:
    use strict; use DB_File; my %sorted; $DB_BTREE->{compare} = sub {$_[1] cmp $_[0]}; tie (%sorted, 'DB_File', undef, O_RDWR|O_CREAT, 0640, $DB_BTREE) or di +e $!; %sorted = 1..30; print map "$_\n", keys %sorted;
    which shows how to create an in memory BTREE with an arbitrary sort order. If you follow tye's suggestion you can skip the arbitrary BTREE, use a temporary file rather than in memory, make the keys be your sort key, and the values be your data. Just insert into the hash and then loop over it using the each construct. (Not keys because that will cause a slurp.)

    If your OS has large file support, your limit is available disk space. If it does not have large file support, your limit is about 2 GB for the temp file, which is a data structure for (considering the BTREE overhead) somewhat over a GB of data. (I would guestimate about 1.3 GB.)

    There is also a File::Sort out there. I prefer the arbitrary text of DB_File rather than having the assumption of lines imposed on me. And if you want to sort data structures rather than text, be sure to look at MLDBM (and expect a big slowdown).

Re: Slow at sorting?
by dws (Chancellor) on Nov 22, 2001 at 00:09 UTC
    My test logfile was 100MB, my first problem was memory, loading the whole log into memory eventually sucked all my resources (382 MB of ram)

    Whoa! Even after you've made one pass to build your sort key, chances are quite good that you're falling prey to virtual memory thrashing. If that's the case, then the C++ code might well run faster on the 100MB sample (depending on how you code, C++ is going to have a lower memory footprint), but at some point you're going to throw more data at the C++ code, and it'll fall over, too. Do you have any visibility onto virtual memory behavior (e.g., can you get a running page-fault count?)

    As I see it, you have a few choices if you want to stay with Perl:

    1. Add more physical memory (which these days is an inexpensive proposition), or
    2. Break the file into smaller pieces, sort the pieces, then merge them later, or
    3. Do both
    Depending on your production dataset size, you may have the same options with C++.

Re: Slow at sorting?
by shotgunefx (Parson) on Nov 22, 2001 at 01:13 UTC
    Perl's sorting algorithim is basically a quicksort. I don't know exactly what your data looks like but quicksort usually behaves very badly with nearly sorted data. This is for a logfile correct? It's possible that it's nearly sorted.

    Another possibility. When faced with sorting an 800 Meg Search index, I gave up using pure Perl and just used system to punt it off to GNU sort. It went from over 8 hours to around 30 minutes. I had no desire to recreate the functionality of handling files this size. I just needed it to work.

    You also might want to check out Mastering Algorithims in Perl by O'Reilly. It has a really good chapter on sorting.

    -Lee

    "To be civilized is to deny one's nature."

      Perl's sorting algorithm is a modified quicksort which chooses elements in a pattern which makes it fast on sorted data sets.

      There are slow patterns, which is why Perl 5.8 will default to a mergesort. But the slow patterns are a lot harder to hit than just sorting the data.

        Out of curiosity, why mergesort over (for example) heapsort?

        --
        :wq
        Thanks tilly. Didn't know that. Learn something new everyday.

        -Lee

        "To be civilized is to deny one's nature."
Re: Slow at sorting?
by mirod (Canon) on Nov 22, 2001 at 02:54 UTC

    Just one comment on sort efficiency: contrary to what a couple of posters seem to think, Perl sorts strings faster than numbers. The whole point of Uri and Larry's paper is that as string sort is the default, hard-coded sort, it does not have to call any Perl subroutine, and thus works _real_ fast. So the trick to fast sorting is to be able to do just a sort @array, not a sort {$a <=> $b}, @array.

    And BTW if you are looking for a module to do this automagically, well it looks like File::Records is looking for a new owner.

Re: Slow at sorting?
by clintp (Curate) on Nov 22, 2001 at 01:11 UTC
    A few thoughts
    • Your sample data set wasn't very large. Would it be possible for you to make even smaller sort keys? Getting the memory footprint of @sorton a bit smaller might help. Since the sort seems to be the killer, and not the preparation time, you might want to spend more time prepping the data.
    • Can you do that by making a completely numeric sort key? Since, except for bracketed field ([]), everything else seems to be numeric. Then you wouldn't need an alphabetic comparison here. Packing things down into a smaller value might prove useful. Watch for overflows.
    • At the very least, can you get that bracketed field down to a couple of characters instead of a long string?
    • How many records _are_ we talking about? Not "100MB" but, how many lines of data?
    • (tiny optimization)Can you pack the offset into the sort key as well? Your sort would be sort @sorton instead of needing the indirection there.
    Kinda makes me wish I had the actual file to try this out....
Re: Slow at sorting?
by orbital (Scribe) on Nov 22, 2001 at 00:18 UTC
    I know it looks like a case of thrashing on the HD but this is simply not the case. Running this on a W2K Pro box I watched the task manager and even setup process logging on the script. Not once did it use Virtual Memory, I peeked at 178MB of memory usage, this still left me with almost 100MB free with everything else running.

    I know C++ code is going to be faster but this seems to be way way too fast.

Re: Slow at sorting?
by guha (Priest) on Nov 22, 2001 at 00:29 UTC
    Could you post a few lines of the input file, that would give at least me something more to go for ?

    Are for instance $1 and $3 of constant length ?

    Update

    Oops, just checked the previous post and found it! :-\

Re: Slow at sorting?
by orbital (Scribe) on Nov 22, 2001 at 01:31 UTC
    clintp, sorry about just listing the file size I should have known better. The log file I was testing on is a total of 1,401,986 lines with 1,401,950 matching my criteria.

    I gave a sample of the matching lines in my previous post, but incase you missed it here you go.

    CD1\01100809.pdf(1) - [Account Number] Indexed key "654546654". CD2\01100809.pdf(1) - [Invoice Date] Indexed key "10/08/2001". CD1\01100809.pdf(1) - [Customer Name] Indexed key "FOOBAR". CD2\01100809.pdf(1) - [Contact Name] Indexed key "Dr. FOO". CD4\01100809.pdf(20) - [Account Number] Indexed key "54356564".

    If you really want the full logfile I could strip out all sensitive data and throw it out somewhere where you can grab it, if this is something you really want to play with let me know...

      CD1\01100809.pdf(1) - [Account Number] Indexed key "654546654".

      According to an earlier post, the bold fields are the ones you're sorting on. If that's truly the case, and if the text within brackets comes from a limited dictionary, then this sort can be done entirely with numbers.

      First, combine the CD number with the file number, yielding (in this case)   101100809 Then map "Account Number" into its pre-determined sequence number. You're then left sorting something like   [ 101100809, 47, seek-address ] on the first two fields, which are now numbers.

        How about extending it to something like

        pack("n N n", $1, $2, $3).$4

        thereby we would sort CD# > 9 correctly, decrease the memory footprint and allow for the fast intrinsic ASCIIbetical sort.
        Perhaps even to the point where fast hash lookups can be used.

        It's a pity the data is in practice unavailable

        Here's what I would like to time ...

        . . my (%sort_hash); my $offset = 0; while (<IN>) { if( m/^CD(\d+)\\(\d+)\.pdf\((\d+)\)\s-\s\[(.+?)\]/ ) { $sort_hash{pack("n N n", $1, $2, $3).$4} = $offset; } $offset = tell(IN); } foreach my $k (sort keys %sort_hash) { seek IN, $sort_hash{$k}, 0; print OUT scalar(<IN>); }
Re: Slow at sorting?
by kwoff (Friar) on Nov 22, 2001 at 02:17 UTC
    Is it possible to put the log into a database table? In that case, put an index on whatever columns need sorted and forget about perl sorting. Unless that kind of thing amuses you. :)
Re: Slow at sorting?
by guha (Priest) on Nov 23, 2001 at 20:06 UTC
    I have been thinking about the pros and cons of using hash or array.

    So I made myself some data with a script approximating something like orbitals example, using the rand()-funktion on CD#, pdf-name, unknown digits in parenthesis and the different strings in brackets.
    I know it's not the real thing but for what I trying to find out it seems OK to me.

    The two approaches I've tested is either using a hash, straightforward and easy on the programmer, see my previous post. Or using an array and tucking the filepos to the right side, see petrals post.

    The tests were run on a measly Pentium 133 with 64MB RAM and the results are as of the table below(YMWV).

    The empty cells indicate heavy use of virtual memory.
    Impatience being a virtue I only gave it double the expected time before I hit Ctrl-C.

    kLines Filesize Hash Array
      MB sec sec
    100 6,16 11 12
    150 9,24 16 18
    175 10,78 19 22
    190 11,704 22 24
    200 12,32 23 25
    210 12,936 24 26
    220 13,552 25 27
    230 14,168 62 29
    250 15,4   31
    300 18,48   37
    350 21,56   44
    400 24,64   107
    500 30,8    

    My interpretation is that if you got the memory then the hash method is slightly faster, but using the array method will take you about twice as far in term of possibel file sizes.

    Thinking about it, it seems rather logical considering that the hash is both key and value whilst array is value only.

    It's also nice to see that both variations behave linearly with increasing volume until VM sets in, just as it's written in A Fresh Look at Efficient Perl Sorting by Uri Guttman and Larry Rosler.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://126837]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2024-04-25 05:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found