arivu198314 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I'm looking someone to optimize my code

input file

picturesque liar fight nor fly 24 hours love life wrinkled a million dollars love life even plan you attempt things a million dollars many hardships many hardships married head many hardships this year secret shame present apple pie many hardships elephant careful peace many hardships apple pie good afternoon seven levels

Output is, we should count the items and which items coming more times in the input. I need to print most coming items of 5. Means, output for this input is

many hardships love life apple pie a million dollars this year

I have written a script, but its taking some time to finish this task.

#!/usr/bin/perl open(INP, '<', $ARGV[0]); while ($item=<INP>) { chomp($item); (exists $query{$item}) ? ($query{$item}=$query{$item}+1) : ($query +{$item}=1); } print join("\n", (sort {$query{$b}<=>$query{$a}} (keys %query))[0..4]) +."\n";

Basically i want to know, how to optimize this code as ninja speed

Replies are listed 'Best First'.
Re: Speed up my code
by toolic (Bishop) on Dec 14, 2011 at 13:22 UTC
    I doubt it will have much impact on speed, but
    (exists $query{$item}) ? ($query{$item}=$query{$item}+1) : ($query +{$item}=1);
    is more simply written as
    $query{$item}++;
    Did you profile your code (see perldoc perlrun)?
Re: Speed up my code
by BrowserUk (Patriarch) on Dec 14, 2011 at 13:46 UTC
    but its taking some time to finish this task.

    It won't make any discernible difference on such a small dataset as your sample, but avoiding the sort for much larger datasets should be a win:

    #! perl -slw use strict; my %hash; ++$hash{ <> } until eof(); my @top5; for( keys %hash ) { for my $i ( 0 .. 4 ) { if( !defined( $top5[ $i ] ) or $hash{ $_ } > $hash{ $top5[ $i +] } ) { splice @top5, $i, 0, $_; pop @top5 if @top5 > 5;; last; } } } print @top5; __END__ many hardships a million dollars love life apple pie you attempt things

    Note: the difference between my output and your expected is due to there being no clear winner for 5th place.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Much shorter :)

      $ perl -ne'$x{$_}++}END{print"$x{$_}\t$_"for(sort{$x{$b}<=>$x{$a}}keys +%x)[0..4]' test.txt 5 many hardships 2 a million dollars 2 love life 2 apple pie 1 you attempt things

      Enjoy, Have FUN! H.Merijn

      Update: Ignore this. The optimisation broke it.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

Re: Speed up my code
by jethro (Monsignor) on Dec 14, 2011 at 13:35 UTC

    I assume your input file is much biggger in reality otherwise I can't imagine this script taking a lot of time. In that case sorting the whole %query hash to only get the 5 highest might be slow

    Instead you could use this:

    my @result=([0,''],[0,''],[0,''],[0,''],[0,'']); foreach (keys %query) { my $sw; my $n= [$query{$_},$_]; if ($n->[0]>$result[4]) { if ($n->[0]>$result[3]) { if ($n->[0]>$result[2]) { if ($n->[0]>$result[1]) { if ($n->[0]>$result[0]) { $sw=$result[0]; $result[0]=$n; $n=$sw; } $sw=$result[1]; $result[1]=$n; $n=$sw; } $sw=$result[2]; $result[2]=$n; $n=$sw; } $sw=$result[3]; $result[3]=$n; $n=$sw; } $sw=$result[4]; $result[4]=$n; $n=$sw; } } print join("\n",map{$_->[1]},@result);

    The speedup of this somewhat ugly contraption lies in the expectation that the first 'if' will be true very very seldom for a big hash

Re: Speed up my code
by RichardK (Parson) on Dec 14, 2011 at 13:47 UTC

    In your data there a lots of lines that only appear once, so why is 'this year' the correct choice to fill out your top five?

      That's not important, means the last one. Since all the other items occuring only once

Re: Speed up my code
by locked_user sundialsvc4 (Abbot) on Dec 14, 2011 at 17:57 UTC

    It seems to me that a simple hash would do nicely.   Perl has many shortcuts to simplify this sort of thing, for example:

    my $foo = "bar"; my $counts = {}; $$counts{$foo}++; // or $counts->{$foo}++; // yields: "$$counts{"bar"} == 1
    You don’t have to worry if the string is already in the hash; you don’t have to initialize the bucket to zero.   If the bucket isn’t in there, it is automagically initialized with the value of '1.'   If it is, it’s incremented.   It Just Works.™

    There are many ways to get the final tallies out of the structure, depending on your needs.

    This straight-ahead strategy works excellently for any data volume that can be reasonably expected to fit entirely in memory without incurring page-faults.   (Which, given the beefy size of computers these days, is a pretty safe bet.)

    The original choice of Perl was that the word was an acronym for P(ractical | ragmatic) Extraction and Reporting Language, otherwise known as The Swiss Army® Knife.   And this is one of the reasons why.   Text-handling tasks, that lots of folks have to do lots of ... those bread-and-butter tasks ... are easy to code and ruggedly implemented.

      Way I heard it, the original choice of Perl by Larry was because he wanted to name his creation Pearl, but there was already a 'pearl' application on the system on which he was developing. The individual letters meant nothing. Perl was then backronymed as, among others, P(ractical | ragmatic) Extraction and Reporting Language.