Script Executes Very Slowly

doubleqq has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Script Executes Very Slowly by NetWallah (Canon) on Aug 04, 2014 at 19:08 UTC
The only "efficiency" factor I see is that you are unpacking/splitting the @flab lines each time you read from <LONG_LIST>. You should get that outside the loop, and perhaps save the contents of the split @flab into a hash, with the 'name' as the key. This will enable you to lookup $hash{$sTab[1]} , instead of your current regex compare. Please do not post word-processor mutilated, un-compilable code. That makes identifying your intent difficult. Profanity is the one language all programmers know best.	[reply]
Re^2: Script Executes Very Slowly by doubleqq (Acolyte) on Aug 04, 2014 at 19:33 UTC
Just a quick update: read the array into a hash and made some changes to the lookup. Using Benchmark, the execute time is.... 4 seconds! Results verified and everything looks copacetic. If I could I would buy you a beer.	[reply]
Re^3: Script Executes Very Slowly by AnomalousMonk (Archbishop) on Aug 04, 2014 at 22:07 UTC
If I could I would buy you a beer. Or maybe Donate? (hint, hint)	[reply]
Re^2: Script Executes Very Slowly by doubleqq (Acolyte) on Aug 04, 2014 at 19:13 UTC
That must be it! I couldn't tell where I was performing a redundant step. Also you're suggestion of a hash is top of my list to do now. Apologies for the terrible formatting, I tried salvage it best I could, but thank you for the feedback nonetheless.	[reply]
Re: Script Executes Very Slowly by Old_Gray_Bear (Bishop) on Aug 04, 2014 at 19:25 UTC
Well, my first thought was "This is a job for the NYT Profiler. Then I took a look at the code. You say there are 200 name-title combinations in `@flab` and 72,000 lines in the match file. That means that you are executing your `foreach` loop 1.4 MILLION times (7200 lines x 200 comparisons). Rewrite your code. Put the 200 entries into a hash keyed by "$name---$title". Then the entire loop collapses in to a single hash-lookup after extracting the fields and building the key. If you don't want to do such a major restructuring, at least insert `last` after you print the results. Once you have found the name, you don't need to check any further. Over a large number of searches this will reduce your search time by roughly half. (Making assumptions about normally distributed data, so your results may vary.) Note: This assumes that the 'name' fields are unique. If there can be multiple 'title's attached to the same 'name' you really do have to search the entire `@flab` array for each line. Sigh. ---- I Go Back to Sleep, Now. OGB	[reply] [d/l] [select]
Re: Script Executes Very Slowly by roboticus (Chancellor) on Aug 04, 2014 at 19:45 UTC
doubleqq: I didn't bother trying to build sample data to test with, so I'm gonna make a few guesses. Perhaps one or more may be helpful. First, in your inner loop, you're splitting out the @match array and then checking for a regular expression match to decide whether to print a row. I'd suggest checking with the regular expression first, modifying it a little to ensure that it stops at the tab, and capturing the data after the tab for printing, something like this: `foreach my $line (@flab) { if ($line =~ /[^\t]\Q$sTab[1]\E\t([^\t]+)/) { print RESULT "$sTab[0]\t$sTab[1]\t$2\n"; } }` [download] Next, if you're actually checking for an exact name, rather than using a regular expression, you could just use the eq operator. Even better, if you are checking for an exact name, then you can convert `@flab` into a hash, and do the lookup by the name: `while (my $sLine = <LONG_LIST>) { chomp $sLine; my ($company, $name) = split /\t/, $sLine; if (exists $flab{$name}) { print RESULT "$company\t$name\t$flab{$name}{TITLE}\n"; } }` [download] If you do* need to use a regular expression, then you may be better off creating a composite one from your @flab entries. That way you can reduce the number of regular expression searches you actually perform. If you also convert `@flab` into a hash, then you could do it (something) like: `my $tmp = join("\|", keys %flab); my $regex = qr/(?:$tmp)/; while (my $sLine = <LONG_LIST>) { chomp $sLine; if ($sLine =~ $regex) { my ($company, $name) = split /\t/, $sLine; print RESULT "$company\t$name\t$flab{$name}\n"; } }` [download] ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re: Script Executes Very Slowly by AppleFritter (Vicar) on Aug 04, 2014 at 20:51 UTC
Howdy doubleqq, welcome back to the Monastery! This is my code that works but runs slow: Are you sure that's your actual code, your working code? I'm asking because of this: `print RESULT “Company"."\t”.”Name"."\t”.”Title"."\n";` [download] and this: `if (($match[0] =~/\Q$sTab[1]\E/) and (define $match[1])){` [download] `“` and `”` aren't valid quoting characters in Perl, and `define` isn't a (built-in) function: you most likely meant `defined` there, but the uncaught typo makes me wonder. That said, here's two tips: Don't do things in loops that you can do outside loops. In particular, pre-process your data as much as possible. Use data structures that are a natural fit for what you're trying to accomplish. In your code, you're not just needlessly reprocessing your list of names and titles, you're doing so for every single last line read from your `$longList`. Don't do that; preprocess it into an efficient data structure, e.g. a hash, and do so outside the main loop. Better yet, have whatever generates this list generate a hash in the first place, if that is an option. For example, take the following: #!/usr/bin/perl use strict; use warnings; use feature qw/say/; # I'm assuming you can't control how @flab is populated. my @flab; push @flab, "name_$_\ttitle_$_" for (1..200); # turn @flab into a hash my %titles = (); foreach (@flab) { my ($name, $title) = split /\t/; $titles{$name} = $title; } open my $LONG_LIST, "<", "longlist.txt" or die "Cannot open file: $!"; say "Company\tName\tTitle"; while (<$LONG_LIST>) { chomp; my ($company, $name) = split /\t/;; say "$company\t$name\t$titles{$name}" if($titles{$name}); } [download] To test this, I generated a `longlist.txt` file as follows: $ for i in `seq 1 72000`; do echo -e company_${i}\\tname_${i} >>longli +st.txt ; done [download] The above script then runs in 0.122s (wallclock time) on my system, compared to about 13s for the version you posted above, a speedup of a factor of more than 100. With bigger `@flab`s, you'll see even more of a speedup. Other notes: Your script as posted will not work correctly when you've got names that are contained in other names. For instance, if you've got both `Fred Larry Flintstone` and `Larry Flint` in your `@flab`, and your `$longList` contains an entry for the latter, you'll get output for both, which I assume is not what you want. Your output columns won't line up if your company names/names/titles get too long; if that's undesired, you could use Text::Table or so to get nicer output with a minimum of effort.	[reply] [d/l] [select]
Re: Script Executes Very Slowly by neilwatson (Priest) on Aug 04, 2014 at 18:51 UTC
Try some profiling to find the slow parts of you code. http://perl.com/pub/2004/06/25/profiling.html http://perldoc.perl.org/Benchmark.html Neil Watson watson-wilson.ca	[reply]
Re: Script Executes Very Slowly by oakb (Scribe) on Aug 04, 2014 at 19:33 UTC
The problem is that you are processing (`split`ting) @flab lines 14.4 million (200 @flab lines x 72,000 LONG_LIST lines = 14.4 million `split`s) times, when you should process them only 200 times. You should pre-process the @flab lines and put them into a usable data structure so that, once you've gotten a line from LONG_LIST ready, it's just a straight-up comparison. I'm not going to write your code for you, but you basically need to move `#split each @flab line into two fields ‘Name’ and ’Title' my @match = split (/\t/,$line);` [download] so that it happens before you enter your `while` loop to work on LONG_LIST. I'll be very surprised if that doesn't make a huge difference in the time it takes to run your script. It would be very much appreciated if you would let all of us know your results after the change.	[reply] [d/l] [select]
Re: Script Executes Very Slowly by Laurent_R (Canon) on Aug 04, 2014 at 23:37 UTC
Last year, I made a talk at the French yearly Perl Conference where my subject was: "How to use efficiently Perl data structures to improve performance". I had 3 examples. The first one was about caching known results (i.e. storing results in a hash or an array to avoid having to calculate them again), the second one was an analysis of a i-phone game (where I could reduce a potential 1,350 billion possible games to only 2.3 million actual optimal games) and the third one was an actual professional problem where I was able to divide by a factor of 1400 the runtime of the crucial part of a program through a very careful use of about a dozen Perl (mostly nested) hashes. The overall program runtime fell from about 60 days to 13 hours, but the small part of the program that created the problem ran 1400 faster (from about 59.5 days to an hour).	[reply]
Re: Script Executes Very Slowly by Anonymous Monk on Aug 04, 2014 at 19:21 UTC
Move as many operations as you can outside of the loop. At the moment that's mainly `my @match = split (/\t/,$line);` - you can `split` `@flab` before the main loop and store the results as an array of arrays, e.g. `@flab = map {[split /\t/]} @flab;` Also, I am curious about "`$match[0] =~/\Q$sTab[1]\E/`": is your intent really to ask "is the string `$sTab[1]` contained anywhere in the string `$match[0]`"? If so, then you could try and see if index is faster. If not, and you're looking for an exact match instead, then perhaps you should instead put all of `@flab` into a hash with the name as a key, and use that for your lookup.	[reply] [d/l] [select]
Re: Script Executes Very Slowly by neilwatson (Priest) on Aug 04, 2014 at 18:27 UTC
Perhaps format your code better to fit on the page. The lack of white space and line wrapping make it hard read. Monks may skip this node. Neil Watson watson-wilson.ca	[reply]
Re^2: Script Executes Very Slowly by doubleqq (Acolyte) on Aug 04, 2014 at 18:44 UTC
Thank you, that formatting was terrible. I tried to fix it somewhat, hopefully it is easier to read now.	[reply]
Re: Script Executes Very Slowly by wjw (Priest) on Aug 06, 2014 at 03:37 UTC
If this is a regular activity for you, I agree with what an Anonymous Monk mentioned previously. SQLite might be a good solution. Fast (for that number of records), simple, and well supported. If that 72k line file changes (additions, deletions) regularly, the DB update can be easily automated as well. Am kind of partial to SQLite for what I perceive you may want to be doing on a regular basis. 'Tis worth checking out at any rate. If it is a one-off... well you already have a solution provided by others here. ...the majority is always wrong, and always the last to know about it... Insanity: Doing the same thing over and over again and expecting different results... A solution is nothing more than a clearly stated problem...otherwise, the problem is not a problem, it is a facct wjw	[reply]
Re: Script Executes Very Slowly by Anonymous Monk on Aug 05, 2014 at 13:03 UTC
If this is something you do very often, SQLite database-files are very handy: just put the two lists into tables and do an inner join. No servers are required, the tool is entirely public/free, and, if you pay attention to the need to use transactions when writing, the process can be extremely fast. You might be able to avoid writing "a custom program" in any language at all to get a job done.	[reply]
Re^2: Script Executes Very Slowly by Anonymous Monk on Aug 05, 2014 at 15:25 UTC
If that's you, sundialsvc4, you've forgotten to log in again. here too	[reply]