Yeah, that's less code, but as you pointed out not much faster... :-(
Is this basically "as fast as it gets" when trying to parse these type of files? I have the option of creating some other type of datastore (as long as it can be stored in a file). Would tying a dbm file be faster possibly? (I plan on testing this, but I must admit to not being overly optimistic..)...
- Greg | [reply] |
There are a couple of problems with your current "flat file" solution (and all flat-file alternative solutions). First, you are reading it in line by line. Sure, you could slurp, but even then there's not a huge speed difference, and you run into scalability problems.
Second, you're splitting lines on various things. split uses a regexp-like thingy, and that means revving up the regexp engine, which while well optimized isn't as fast as non-regexp alternatives. The problem is that your current file format doesn't lend itself well to non-regexp (and non-split) alternatives.
If you have control over the datasource, there are a few things you can do for better speed.
One suggestion is, instead of using "\n" delimited records, and "\t" delimited key/value pairs, go with a more "regular" format. One possibility would be fixed-width fields. With that sort of solution, at least you can unpack each record. That's going to be faster than splitting on a RE. If each "line" (or record) is of equal byte-length, and each key/value within each record is of fixed width, you can use seek, tell to jump around in the file, and unpack to grab keys/values from each record. It's pretty hard to beat that for speed, within Perl.
Another possibility is to abandon the flat file, and go with a database. You mentioned that you wanted to maintain a single-file for your data though. Ok, no problem. Use DBD::SQLite. It is a pretty fast database implementation that stores all of the data in one single file. There is database overhead to consider, but scalability is good, and you don't need to be as careful about maintaining equal-byte-length records with fixed-width fields.
And yet another possibility is to use the Storable module to freeze and thaw your datastructures. The module is written in XS (if I'm not mistaken) and optimized for speed already. It's not as scalable of a solution, but speed is pretty good.
| [reply] |
Can you back up your assertion that reving up the regexp engine has a significant slowdown compared to non-regexp alternatives?
My understanding is that the regular expression engine will recognize when a regular expression just searches for a constant string and switches to the same Boyer-Moore optimization that index uses. Solutions like unpack can win because they get rid of having loops written in Perl. Not because they walk through the string significantly faster.
Yes, there is overhead to regular expressions, but it is truly marginal.
| [reply] |
Your database (whether RDBMS or a tied hash or whatever) will be the most expensive thing you do regardless.
| [reply] |