doubleqq has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have a question about a script that loops one list to look for values in another list. I am happy to say the script produces the correct results, but I am surprised at how long it takes. The first list is 72,000 lines long and it is searching through another list that is only 200 lines lines long. To process a 1.2MB list with a 102KB list, takes about 23 minutes. If that’s how long it takes that’s ok. But I have a strong suspicion that I entered some faulty logic here.

This is my code that works but runs slow:

use strict; use warnings; ….. #Array constains 200 elements, each element is a name and a title tab +separated @flab = [ ‘Name’ and a ’Title’ ] #open LONG_LIST of 72,000 lines for reading line by line open (LONG_LIST, $longList) or die "Cannot open file"; print RESULT “Company"."\t”.”Name"."\t”.”Title"."\n"; while (my $sLine = <LONG_LIST>){ chomp $sLine; #split LONG_LIST lines into two fields ‘Company' and ’Name' my @sTab = split (/\t/,$sLine); foreach (@flab){ my $line = $_; #split each @flab line into two fields ‘Name’ and ’Title' my @match = split (/\t/,$line); #match ’Name’ in LONG_LIST to ‘Name' in @flab if (($match[0] =~/\Q$sTab[1]\E/) and (define $match[1])){ #print ‘Company', ‘Name', ’Title' print RESULT $sTab[0]."\t".$sTab[1]."\t".$match[1]."\n"; } } } ….. __END__

No issues with warnings or strict compilation errors and as I mentioned it works. It produces a list of all the names with their companies and their title. It just takes really long. I suspect my logic is doing an unnecessary loop or processing that delays completion but I don’t know what I can change. Any feedback or tips on restructuring my code would be most appreciated.

Thank you!

Replies are listed 'Best First'.
Re: Script Executes Very Slowly
by NetWallah (Canon) on Aug 04, 2014 at 19:08 UTC
    The only "efficiency" factor I see is that you are unpacking/splitting the @flab lines each time you read from <LONG_LIST>.

    You should get that outside the loop, and perhaps save the contents of the split @flab into a hash, with the 'name' as the key.

    This will enable you to lookup $hash{$sTab[1]} , instead of your current regex compare.

    Please do not post word-processor mutilated, un-compilable code. That makes identifying your intent difficult.

            Profanity is the one language all programmers know best.

      Just a quick update: read the array into a hash and made some changes to the lookup. Using Benchmark, the execute time is.... 4 seconds! Results verified and everything looks copacetic. If I could I would buy you a beer.
        If I could I would buy you a beer.

        Or maybe Donate? (hint, hint)

      That must be it! I couldn't tell where I was performing a redundant step. Also you're suggestion of a hash is top of my list to do now. Apologies for the terrible formatting, I tried salvage it best I could, but thank you for the feedback nonetheless.
Re: Script Executes Very Slowly
by Old_Gray_Bear (Bishop) on Aug 04, 2014 at 19:25 UTC
    Well, my first thought was "This is a job for the NYT Profiler. Then I took a look at the code.

    You say there are 200 name-title combinations in @flab and 72,000 lines in the match file. That means that you are executing your foreach loop 1.4 MILLION times (7200 lines x 200 comparisons).

    Rewrite your code. Put the 200 entries into a hash keyed by "$name---$title". Then the entire loop collapses in to a single hash-lookup after extracting the fields and building the key.

    If you don't want to do such a major restructuring, at least insert last after you print the results. Once you have found the name, you don't need to check any further. Over a large number of searches this will reduce your search time by roughly half. (Making assumptions about normally distributed data, so your results may vary.) Note: This assumes that the 'name' fields are unique. If there can be multiple 'title's attached to the same 'name' you really do have to search the entire @flab array for each line. Sigh.

    ----
    I Go Back to Sleep, Now.

    OGB

Re: Script Executes Very Slowly
by roboticus (Chancellor) on Aug 04, 2014 at 19:45 UTC

    doubleqq:

    I didn't bother trying to build sample data to test with, so I'm gonna make a few guesses. Perhaps one or more may be helpful.

    First, in your inner loop, you're splitting out the @match array and *then* checking for a regular expression match to decide whether to print a row. I'd suggest checking with the regular expression first, modifying it a little to ensure that it stops at the tab, and capturing the data after the tab for printing, something like this:

    foreach my $line (@flab) { if ($line =~ /[^\t]*\Q$sTab[1]\E\t([^\t]+)/) { print RESULT "$sTab[0]\t$sTab[1]\t$2\n"; } }

    Next, if you're actually checking for an exact name, rather than using a regular expression, you could just use the eq operator. Even better, if you are checking for an exact name, then you can convert @flab into a hash, and do the lookup by the name:

    while (my $sLine = <LONG_LIST>) { chomp $sLine; my ($company, $name) = split /\t/, $sLine; if (exists $flab{$name}) { print RESULT "$company\t$name\t$flab{$name}{TITLE}\n"; } }

    If you *do* need to use a regular expression, then you may be better off creating a composite one from your @flab entries. That way you can reduce the number of regular expression searches you actually perform. If you also convert @flab into a hash, then you could do it (something) like:

    my $tmp = join("|", keys %flab); my $regex = qr/(?:$tmp)/; while (my $sLine = <LONG_LIST>) { chomp $sLine; if ($sLine =~ $regex) { my ($company, $name) = split /\t/, $sLine; print RESULT "$company\t$name\t$flab{$name}\n"; } }

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Script Executes Very Slowly
by AppleFritter (Vicar) on Aug 04, 2014 at 20:51 UTC

    Howdy doubleqq, welcome back to the Monastery!

    This is my code that works but runs slow:

    Are you sure that's your actual code, your working code? I'm asking because of this:

    print RESULT “Company"."\t”.”Name"."\t”.”Title"."\n";

    and this:

    if (($match[0] =~/\Q$sTab[1]\E/) and (define $match[1])){

    and aren't valid quoting characters in Perl, and define isn't a (built-in) function: you most likely meant defined there, but the uncaught typo makes me wonder.

    That said, here's two tips:

    • Don't do things in loops that you can do outside loops. In particular, pre-process your data as much as possible.
    • Use data structures that are a natural fit for what you're trying to accomplish.

    In your code, you're not just needlessly reprocessing your list of names and titles, you're doing so for every single last line read from your $longList. Don't do that; preprocess it into an efficient data structure, e.g. a hash, and do so outside the main loop. Better yet, have whatever generates this list generate a hash in the first place, if that is an option.

    For example, take the following:

    #!/usr/bin/perl use strict; use warnings; use feature qw/say/; # I'm assuming you can't control how @flab is populated. my @flab; push @flab, "name_$_\ttitle_$_" for (1..200); # turn @flab into a hash my %titles = (); foreach (@flab) { my ($name, $title) = split /\t/; $titles{$name} = $title; } open my $LONG_LIST, "<", "longlist.txt" or die "Cannot open file: $!"; say "Company\tName\tTitle"; while (<$LONG_LIST>) { chomp; my ($company, $name) = split /\t/;; say "$company\t$name\t$titles{$name}" if($titles{$name}); }

    To test this, I generated a longlist.txt file as follows:

    $ for i in `seq 1 72000`; do echo -e company_${i}\\tname_${i} >>longli +st.txt ; done

    The above script then runs in 0.122s (wallclock time) on my system, compared to about 13s for the version you posted above, a speedup of a factor of more than 100. With bigger @flabs, you'll see even more of a speedup.

    Other notes:

    • Your script as posted will not work correctly when you've got names that are contained in other names. For instance, if you've got both Fred Larry Flintstone and Larry Flint in your @flab, and your $longList contains an entry for the latter, you'll get output for both, which I assume is not what you want.
    • Your output columns won't line up if your company names/names/titles get too long; if that's undesired, you could use Text::Table or so to get nicer output with a minimum of effort.
Re: Script Executes Very Slowly
by neilwatson (Priest) on Aug 04, 2014 at 18:51 UTC

    Try some profiling to find the slow parts of you code.

    • http://perl.com/pub/2004/06/25/profiling.html
    • http://perldoc.perl.org/Benchmark.html

    Neil Watson
    watson-wilson.ca

Re: Script Executes Very Slowly
by oakb (Scribe) on Aug 04, 2014 at 19:33 UTC
    The problem is that you are processing (splitting) @flab lines 14.4 million (200 @flab lines x 72,000 LONG_LIST lines = 14.4 million splits) times, when you should process them only 200 times. You should pre-process the @flab lines and put them into a usable data structure so that, once you've gotten a line from LONG_LIST ready, it's just a straight-up comparison.

    I'm not going to write your code for you, but you basically need to move

    #split each @flab line into two fields ‘Name’ and ’Title' my @match = split (/\t/,$line);

    so that it happens before you enter your while loop to work on LONG_LIST. I'll be very surprised if that doesn't make a huge difference in the time it takes to run your script. It would be very much appreciated if you would let all of us know your results after the change.
Re: Script Executes Very Slowly
by Laurent_R (Canon) on Aug 04, 2014 at 23:37 UTC
    Last year, I made a talk at the French yearly Perl Conference where my subject was: "How to use efficiently Perl data structures to improve performance".

    I had 3 examples. The first one was about caching known results (i.e. storing results in a hash or an array to avoid having to calculate them again), the second one was an analysis of a i-phone game (where I could reduce a potential 1,350 billion possible games to only 2.3 million actual optimal games) and the third one was an actual professional problem where I was able to divide by a factor of 1400 the runtime of the crucial part of a program through a very careful use of about a dozen Perl (mostly nested) hashes. The overall program runtime fell from about 60 days to 13 hours, but the small part of the program that created the problem ran 1400 faster (from about 59.5 days to an hour).

Re: Script Executes Very Slowly
by Anonymous Monk on Aug 04, 2014 at 19:21 UTC

    Move as many operations as you can outside of the loop. At the moment that's mainly my @match = split (/\t/,$line); - you can split @flab before the main loop and store the results as an array of arrays, e.g. @flab = map {[split /\t/]} @flab;

    Also, I am curious about "$match[0] =~/\Q$sTab[1]\E/": is your intent really to ask "is the string $sTab[1] contained anywhere in the string $match[0]"? If so, then you could try and see if index is faster. If not, and you're looking for an exact match instead, then perhaps you should instead put all of @flab into a hash with the name as a key, and use that for your lookup.

Re: Script Executes Very Slowly
by neilwatson (Priest) on Aug 04, 2014 at 18:27 UTC

    Perhaps format your code better to fit on the page. The lack of white space and line wrapping make it hard read. Monks may skip this node.

    Neil Watson
    watson-wilson.ca

      Thank you, that formatting was terrible. I tried to fix it somewhat, hopefully it is easier to read now.
Re: Script Executes Very Slowly
by wjw (Priest) on Aug 06, 2014 at 03:37 UTC

    If this is a regular activity for you, I agree with what an Anonymous Monk mentioned previously. SQLite might be a good solution. Fast (for that number of records), simple, and well supported.

    If that 72k line file changes (additions, deletions) regularly, the DB update can be easily automated as well.

    Am kind of partial to SQLite for what I perceive you may want to be doing on a regular basis. 'Tis worth checking out at any rate.

    If it is a one-off... well you already have a solution provided by others here.

    ...the majority is always wrong, and always the last to know about it...

    Insanity: Doing the same thing over and over again and expecting different results...

    A solution is nothing more than a clearly stated problem...otherwise, the problem is not a problem, it is a facct

Re: Script Executes Very Slowly
by Anonymous Monk on Aug 05, 2014 at 13:03 UTC
    If this is something you do very often, SQLite database-files are very handy: just put the two lists into tables and do an inner join. No servers are required, the tool is entirely public/free, and, if you pay attention to the need to use transactions when writing, the process can be extremely fast. You might be able to avoid writing "a custom program" in any language at all to get a job done.