biograd has asked for the wisdom of the Perl Monks concerning the following question:
I've been working on a project for my degree, and it's about time to be done with the thing. This script is just one part it. (See my bio for more info.) Anyway, I'm feel like I've never had my code looked over by someone who knows what they're doing, so I posted the db-loading program on my scratchpad, along with a sample of the gene data input. If you feel like you have some spare time, I'd appreciate some constructive criticism.
(I didn't use a generic module because I knew that wouldn't be in the spirit of the project as my advisor saw it. If you'd like to advocate your favorite module, I'll try to study it before the next time I have to do something like this. :) )
My main concern is that the program takes too long to run. (14- 17 hours for 26,000+ records.) There is indexing on several of the fields. I've heard that if I didn't index while inserting data, everything might go faster. I've also heard that the indexing makes the multiple db searches needed go faster during the program, so any gain from indexing at the end is lost from the extra search time within the script. erg. As for the script, I did pass array references to the subroutines instead of copying arrays, but what are some other practical ways that you would optimize it for speed? It blazes on the first thousand records or so, then gets slower as it goes on. I expect that to some degree, since it has to search ever-increasing tables as it progresses, but is 14-17 hours realistic? I am seeing some stuff on optimizing the many regex in Programming Perl (pp 594-9), but I'm not sure the best way to apply it. I have read that tr// is faster than s//, which I could use in a couple places.
Thanks for any comments you can make. If you have questions or need more information, I'll oblige.
-Jenn
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Up for Critique
by dragonchild (Archbishop) on Mar 22, 2002 at 21:15 UTC | |
To answer your question - yes, there are a number of places you could optimize for speed (and maintainability!). ------ Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. | [reply] [d/l] [select] |
by biograd (Friar) on Mar 23, 2002 at 06:52 UTC | |
You'll notice the date on the script is from last year. There are some things I do now that I didn't do then, such as initializing the locally scoped variables ("my") where they are used instead of way up at the top. Most of the suggestions I see in your note and others I just wouldn't have thought of yet though. So, thanks to all for the ton of good suggestions. I'm not sure how many I'll have time to implement before my deadline, but this is a great help for improving my skill level. -Jenn | [reply] |
Re: Up for Critique
by tadman (Prior) on Mar 22, 2002 at 21:07 UTC | |
Now, instead of using an "array" of variables, use a hash. This simplifies things massively: Now, I've just typed this into the editor here, so this is really just theory and not tested, but the idea is that this can be applied to a few other areas where you do the same thing. Further, instead of passing arguments back and forth in a particular fixed order, why not return a HASH-ref? This makes it easy to add new data to the return value without breaking other functions which use it. | [reply] [d/l] [select] |
by biograd (Friar) on Mar 23, 2002 at 06:43 UTC | |
However, I can see the utility of hashes from many of these examples, yours and others below, so I know I'll have to bite this bullet soon. Thanks for the example, tested or not. :) -Jenn | [reply] |
by jeffa (Bishop) on Mar 23, 2002 at 17:31 UTC | |
If the difference in speed is small, i say drop the arrays and use hashes ... but i am lazy. I have this adversion to something called 'typing'. ;) But, arrays are not always faster than hashes. It all depends upon context - if you have to scan the entire array to find something, a hash implementation will probably be faster. But if you have to iterate over all elements - then an array is probably the better choice. Consider the following Benchmark code:
jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat) | [reply] [d/l] |
Re: Up for Critique
by scain (Curate) on Mar 22, 2002 at 21:21 UTC | |
This is just a few quick comments after looking at your code: Anyway, it largely looks like good work. You should be very irritated with your university for not letting you take database design classes. Your advisor should be going to bat for you here. If you want some back up from someone in industry, /msg me. Scott | [reply] |
Re: Up for Critique
by derby (Abbot) on Mar 22, 2002 at 21:16 UTC | |
Use references to pass into and recieve back from your subroutines. Don't bother initializing them in the calling block:
Also, take a look at your regexes. Try to reduce those to
a more reasonable number
You get the idea. Running multiple regexes against a single line can take forever. Try to reduce the regexes into simpler one(s) that will run once against the line and also ensure you use the 'o' flag so you're not spinning your wheels always compiling the regexes. But I think all of those type changes are going to have a minimal impact. You really need to concentrate on your dbs performance and whats going on there. I'm not sure about mysql but some database will commit "transactions" so many inserts (does mysql/innodb table do that?). You can set up your drive to make that number larger and reduce the number of commits. They're may be other stuff but that's just one person's opinion. It's kinda hard to do without being able to run the stuff. Have you tried profiling the script to see where you're spending all your time? At a minimum, you could liberally sprinke Benchmark timings before and after sections of the code and then use timediff to see how long the snippet took. Good luck. ++ for an interesting post. -derby update: ferrency is right about the optimizations. sorry for the mislead. | [reply] [d/l] [select] |
by ferrency (Deacon) on Mar 22, 2002 at 21:37 UTC | |
This regex may not do exactly what you want it to. Specifically, it's not going to remove trailing spaces from the line. The (.*) is greedy, and the following \s* allows a match of 0 whitespace characters, so the whitespace will always end up being matched in the ()'s. There are many ways to fix this, but the easiest might be to simply match (\S*) instead of the . which some people believe should be avoided at all costs. However, I like derby's solution, to replace all of the regex's with one more general regular expression. Update: Looking more closely at your code, your biggest bottleneck is almost definitely in the database and not in the perl. If you can avoid interspersing selects and inserts by keeping a copy of data you've already inserted, you'd be able to remove the table indices until it's built, and gain a lot of speed. Alternately, 26k records aren't really that many... you might consider preprocessing the records in memory into a list or hash, and then insert them all at once instead of processing the data in the order it is in the input file. Alan | [reply] [d/l] [select] |
by biograd (Friar) on Mar 23, 2002 at 07:13 UTC | |
If I get a chance to do a major re-work I will almost surely sort everything into memory first, then do a flurry of inserts/updates. I graduate this May, but I'm interested in getting this thing as fast as possible. I leave it in the hands of my advisor and his future grads to maintain and use (maybe as a chron-job, or whenever TIGR sends out an update) and I'm having pity on them. ;) -Jenn | [reply] |
by biograd (Friar) on Mar 23, 2002 at 07:03 UTC | |
The MySQL dbms doesn't do transactions in the version we have. I think the newest version is incorporating them. Tables are of type "MYISAM". For any intersted persons...Version info: mysql Ver 10.12 Distrib 3.23.25-beta, for pc-linux-gnu (i686) -Jenn | [reply] |
Re: Up for Critique
by webadept (Pilgrim) on Mar 23, 2002 at 03:39 UTC | |
I'm going to use a simple data structure here. Now if you create indexes on id_num, email, url and sent like this and do a query something like this then you aren't going to see a significant amount of speed, in fact things will probably be a bit slower. Indexes should be designed sparingly and to the applications need. If this is a query I'm going to use often, the index should be set up more like this. This will improve this query by a noticeable amount. Also any data file should have a primary key, I do this whether it needs it or not. Indexing will speed up a data base incredibly, but it needs to be done right and sparingly. I've seen some programs with indexes on every field, and that just makes two data tables, what's the sense in that? :-) A table without an index of some sort is just a collection of unordered rows. To find a row or set of rows from such a table, the engine has to examine every single row in the table. Even if it finds it at the top, unless you've put a limit on your query, its going to search through the rest of them. This is wasted time and CPU power. Using an index however the values are sorted and keyed to the table. The engine knows if it finds the answer to stop looking, because it knows anything that follows will not equal the query set. That's a huge advantage. Looking at your queries and finding some common keys in the "where" area, will help you build suitable indexes for those tables. If you are going to use a join, any type of join, those fields should always be indexed, otherwise even on small datasets your programs going to crawl. Now, the disadvantages are in the writing to the files with indexes. Here is where indexes are working against you, because they have to be inserted and updated as well. Some hints on indexing and loading tables follow, in no particular order, just off the top of my head. Bulk loading is faster and better on indexed tables. Use LOAD DATA instead of INSERT when ever you can. Index flushing happens less often and the server needs to pares and interpret one statement, not several If you must use INSERT then use the form that allows multiple rows to be specified in a single insert statement. The more rows you can specify in the statement the better. One trick is to create the table, load all your data and then add the indexes, this comes in handy. If you can't do that, drop the indexes, load the data and then recreated them. Do this only if you can separate your program into a single load area, and then the query area. Since your program is structured the way it is, then this could be an option, on other programs it may not be, and probably won't be, because other users are on the database at the same time, and you'll get a lot of phone calls from angry accountants and beavers. The main question to ask yourself is "can I do this only once" if the answer is no, then don't do it, it will slow your program down. Hope this is some use to you.. there's a lot of good stuff in this node and some of it I'm writing down and playing with, so thanks to all the rest of you.. That's my two bits... what?... I'm over extended?.. damn. ---UPDATE -- Something I just read seemed applicatble thought I would post it in here. According to the O'reily Mastering Algorithms with Perl book -> " page: 26 Efficincy Tip: Hashes Versus Arrays: It's about 30% faster to store data in an array than in a hash. It's about 20% faster to retrieve data from an array than from a hash". -- end update -- Glenn H. | [reply] [d/l] [select] |
Re: Up for Critique
by admiralh (Sexton) on Mar 23, 2002 at 04:40 UTC | |
Here are a few coding comments I have: | [reply] [d/l] [select] |
by davorg (Chancellor) on Mar 23, 2002 at 16:55 UTC | |
1. Use $INPUT_RECORD_SEPARATOR instead of $/ I was wondering why you thought this was a good idea. IMO the aliases from English.pm can end up making your code harder to follow - purely because they are used so rarely and most people are far more used to using the punctuation variables. Also there used to be a performance problem with use English. It made Perl think that you'd used $`, $& and $'. This would slow down all regex matches in your code. I think this has been fixed in recent Perls, but I'm not sure exactly when (or to what extent). --<http://www.dave.org.uk> "The first rule of Perl club is you do not talk about
Perl club." | [reply] |
by admiralh (Sexton) on Mar 24, 2002 at 00:26 UTC | |
Seriously, since this code was written to be looked at by other (non-Perl) types, I figured the long name would help. Now if it seriously causes performance problems, then by all means use $/, but comment what you are doing. And using local to scope the change is a good idea. BTW, that was my first post to Perl Monks. | [reply] |
by biograd (Friar) on Mar 23, 2002 at 07:39 UTC | |
Question: Is 'monitor' similar to the *nix shell command 'time' ? I just discovered that one. -Jenn | [reply] |
by Anonymous Monk on Mar 23, 2002 at 15:09 UTC | |
| [reply] |
Re: Up for Critique
by Anonymous Monk on Mar 23, 2002 at 03:12 UTC | |
One wild suggestion ;-); Don't underestimate the speed of tied DB hashes. As an entry example: $hash{"$nodename"}=$data, $hash{"$nodename".'description'}=$description, $hash{"$nodename".'date'}=$date (if you don't need to split upon retreival), $hash{"$nodename".'date'.'month'}=$datemonth (saves a split upon retreival) etc. By implementing a pseudo java-like namespace scheme you can make some very fast data entry and retreival scheme. Not tested on this ammount of data but certainly tests faster on around 10000 records with many entries (based on / down your figures). There are as ever downsides such as having to write your own retreival routines and then making them interoperable or file locking problems. I believe newer versions of the BSD-DB have transaction support (if you comp dept supports them - they are not free). However you can use any_DB if multiple platform coverage is essential. Plus all of the above (and possibly what comes below too). HLA-B?? rocks! | [reply] |
Re: Up for Critique
by davorg (Chancellor) on Mar 23, 2002 at 17:01 UTC | |
In a couple of places you change the value of $/ to make it easier to read in your input records. In general when changing any of Perl's "special" variables, it's a good idea to get into the habit of localising that change as much as possible. For example:
With the change localised like that you don't run the risk of forgetting what value is in the variable later on in your program. This is particularly important when you're writing modules for other people to use as you don't want to be changing the value of these variables in someone elses code :) --<http://www.dave.org.uk> "The first rule of Perl club is you do not talk about
Perl club." | [reply] [d/l] |
When in doubt, sort
by Thelonius (Priest) on Mar 23, 2002 at 03:46 UTC | |
If the database is empty before your run (it's not clear from your description), then you don't need to index until you are done. Any duplicates will be in consecutive records after you sort, so you will know when there are duplicates. | [reply] |
Re: Up for Critique
by dws (Chancellor) on Mar 23, 2002 at 20:43 UTC | |
I agree that a big chunk of time is going to database overhead. Assuming you have fields correctly indexed, I can think of two things that might help:
| [reply] [d/l] |
Re: Up for Critique
by jpm (Novice) on Mar 23, 2002 at 17:16 UTC | |
If we use a character per nucleotide, this is only 26M bytes. If these figures are low, we could use 2 bits per nucleotide (as there are 4 different ones), giving an increase of a factor of 4 over just using bytes (you'd have to write pack and unpack functions). So I'd download the data from the database(s), do my processing in memory, store any results back, and display the results on some kind of report. Please contact me if you want to get into details (non-perl related!). | [reply] |
Experiment!
by Thelonius (Priest) on Mar 23, 2002 at 23:20 UTC | |
First, make a copy of your program without any database calls and run it. It should run in a matter of minutes. That will confirm whether it is really the database part that is slowing things down. Another quick experiment is to create a database without indexes, and create a version of your program that does only INSERTs, no SELECTs or UPDATEs. If this runs fast, then you should go ahead and use my sort suggestion, above. Besides the sorting idea, there are some MySQL-specific things you can look into. There's a (non-standard) REPLACE statement that I think you can use instead of using SELECT followed by UPDATE or INSERT. Depending on version, there is also a LOAD DATA INFILE statement. | [reply] |
Re: Up for Critique
by Anonymous Monk on Mar 23, 2002 at 05:06 UTC | |
It's often a lot cheaper if you can control the server to upgrade the machine than spending weeks getting better performance out of an older machine. | [reply] |