Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Complex Sorting Optimization?

by orbital (Scribe)
on Nov 21, 2001 at 00:04 UTC ( [id://126609]=perlquestion: print w/replies, xml ) Need Help??

orbital has asked for the wisdom of the Perl Monks concerning the following question:

I am working with some log files that are being generated by a multi-threaded application. So my log file is not in any kind of sorted order.

I was able to figure out how I wanted to parse my data and sort it, I have no problems there. My problem lies in the performance of the sort that I created, lets take a look at the code:

@sortedarray= map{ $_->[0] } sort{ $a->[1] cmp $b->[1] || $a->[2] <=> $b->[2] || $a->[3] cmp $b->[3] } map { if ( m/^(.+?)\((\d+)\)\s-\s\[(.+?)\].+?"(.*?)"\.$/ ) { my ($disc_file,$page,$key,$val) = ($1,$2,$3,$4); [$_,$disc_file,$page,$key]; } } <DATA>; foreach(@sortedarray){ print "$_\n"; } __END__ CD1\01100809.pdf(1) - [Account Number] Indexed key "654546654". CD2\01100809.pdf(1) - [Invoice Date] Indexed key "10/08/2001". CD1\01100809.pdf(1) - [Customer Name] Indexed key "FOOBAR". CD2\01100809.pdf(1) - [Contact Name] Indexed key "Dr. FOO". CD4\01100809.pdf(20) - [Account Number] Indexed key "54356564". CD4\01100809.pdf(20) - [Invoice Date] Indexed key "10/08/2001". CD1\01100809.pdf(20) - [Customer Name] Indexed key "FOOBAR". CD1\01100809.pdf(20) - [Contact Name] Indexed key "Dr. FOO". CD1\01100814.pdf(33) - [Account Number] Indexed key "56357576537". CD3\01100814.pdf(33) - [Invoice Date] Indexed key "10/08/2001". CD3\01100814.pdf(33) - [Customer Name] Indexed key "FOOBAR". CD1\01100814.pdf(33) - [Contact Name] Indexed key "Dr. FOO". CD2\01100813.pdf(27) - [Account Number] Indexed key "73677576". CD3\01100813.pdf(27) - [Invoice Date] Indexed key "10/08/2001". CD1\01100813.pdf(27) - [Customer Name] Indexed key "FOOBAR". CD3\01100813.pdf(27) - [Contact Name] Indexed key "Dr. FOO".

This code does exactly what I want it to accomplish, it ignores lines that don't match and sorts by my CD\filename then by page number and then by the keys being indexed.

The problems I am running into is with the speed of this sort (seems very slow 16sec on a 2.3MB file on a PIII 766Mhz) can I speed this code at all?

My other issues is with file size, the larger the logfile the more memory perl hogs. What kind of techniques can I use for sorting a huge file without taking a bunch of RAM in the process.

Replies are listed 'Best First'.
(tye)Re: Complex Sorting Optimization?
by tye (Sage) on Nov 21, 2001 at 00:24 UTC

    First, you need an "else { () }" so that you return an empty list when you don't match (right now you are returning a scalar false value instead, which is roughly an empty string). And $a being "" means that $a->[0] is accessing the array that has an empty name (which means either you aren't useing strict or you have no unmatching lines).

    If that change alone doesn't speed things up enough, then you can certainly do sorting faster than a Schwartzian Transform (creating all of those tiny anonymous arrays does require a bit of time and memory), though the ST is less prone to some common coding errors.

    my @sorton; my @lines= grep { if( m/^(.+?)\((\d+)\)\s-\s\[(.+?)\].+?"(.*?)"\.$/ ) { push @sorton, $1.pack("N",$2).$3; 1; } } <DATA>; my @idx= sort { $sorton[$a] cmp $sorton[$b] } 0..$#sorton; foreach( @lines[@idx] ) { print $_; }

            - tye (but my friends call me "Tye")
Re: Complex Sorting Optimization?
by dragonchild (Archbishop) on Nov 21, 2001 at 00:21 UTC
    I'm going to go out on a limb and say that your sort is just fine - it's your map that's the problem. Instead of
    map { if ( m/^(.+?)\((\d+)\)\s-\s\[(.+?)\].+?"(.*?)"\.$/ ) { my ($disc_file,$page,$key,$val) = ($1,$2,$3,$4); [$_,$disc_file,$page,$key]; } }
    Maybe try something like:
    map { m/^(.+?)\((\d+)\)\s-\s\[(.+?)\]/ ? [$_,$1, $2, $3] : () }
    The big thing is that you're doing unnecessary matches and variable creation/assignments. map is naturally more obfuscated than other operations. If you want to de-obfuscate it, consider changing to a multi-operation sequence.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Re: Complex Sorting Optimization?
by dash2 (Hermit) on Nov 21, 2001 at 00:10 UTC
    Meta-reply:

    use Devel::DProf and dprofpp.pl to find out what is slowing you down. (If need be, put each line into a separate subroutine so that DProf will analyse it.)

    #perl -DDProf script.cgi

    #dprofpp.pl

    Second meta-reply

    Maybe it would be better if you read only the new bits of the logfile. I don't know if this logfile keeps on growing, but if so, keep a count of the number of lines you've already processed, and append the results of your processing to a file. dave hj~

Re: Complex Sorting Optimization?
by dws (Chancellor) on Nov 21, 2001 at 11:06 UTC
    If $disc_file looks like "CD1\01100809.pdf", and the n in CDn is in 1..9, and the filenames are uniformly 8 digits long, you can turn the initial string comparison into a numeric comparison by concatenating the CD number with the file number for purposes of sorting. For example,   CD1\01100809.pdf becomes   102200809 Numeric comparisons are faster than string comparisons. If the CD number can be larger than 9, zero pad that part as necessary so that it is a uniform width.

Re: Complex Sorting Optimization?
by frankus (Priest) on Nov 21, 2001 at 00:25 UTC

    Quicky, I'm off home... use grep instead of the second map, ditch the variables and use [$_,$1,$2,$3,$4] also try o on the regex.

    hope this helps.

    --

    Brother Frankus.

    ¤

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://126609]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-04-20 03:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found