Re: Re: Re: Need to process a tab delimted file *FAST*

There are a couple of problems with your current "flat file" solution (and all flat-file alternative solutions). First, you are reading it in line by line. Sure, you could slurp, but even then there's not a huge speed difference, and you run into scalability problems.

Second, you're splitting lines on various things. split uses a regexp-like thingy, and that means revving up the regexp engine, which while well optimized isn't as fast as non-regexp alternatives. The problem is that your current file format doesn't lend itself well to non-regexp (and non-split) alternatives.

If you have control over the datasource, there are a few things you can do for better speed.

One suggestion is, instead of using "\n" delimited records, and "\t" delimited key/value pairs, go with a more "regular" format. One possibility would be fixed-width fields. With that sort of solution, at least you can unpack each record. That's going to be faster than splitting on a RE. If each "line" (or record) is of equal byte-length, and each key/value within each record is of fixed width, you can use seek, tell to jump around in the file, and unpack to grab keys/values from each record. It's pretty hard to beat that for speed, within Perl.

Another possibility is to abandon the flat file, and go with a database. You mentioned that you wanted to maintain a single-file for your data though. Ok, no problem. Use DBD::SQLite. It is a pretty fast database implementation that stores all of the data in one single file. There is database overhead to consider, but scalability is good, and you don't need to be as careful about maintaining equal-byte-length records with fixed-width fields.

And yet another possibility is to use the Storable module to freeze and thaw your datastructures. The module is written in XS (if I'm not mistaken) and optimized for speed already. It's not as scalable of a solution, but speed is pretty good.

Dave

Comment on Re: Re: Re: Need to process a tab delimted file FAST

Replies are listed 'Best First'.
Re: Re: Re: Re: Need to process a tab delimted file FAST by tilly (Archbishop) on Mar 03, 2004 at 15:17 UTC
Can you back up your assertion that reving up the regexp engine has a significant slowdown compared to non-regexp alternatives? My understanding is that the regular expression engine will recognize when a regular expression just searches for a constant string and switches to the same Boyer-Moore optimization that index uses. Solutions like unpack can win because they get rid of having loops written in Perl. Not because they walk through the string significantly faster. Yes, there is overhead to regular expressions, but it is truly marginal.	[reply]
Re: Re: Re: Re: Re: Need to process a tab delimted file FAST by davido (Cardinal) on Mar 03, 2004 at 18:19 UTC
You know what, tilly is right. I always get into trouble every time I assume that unpack will do its job faster than a regexp. Here's a test snippet proving tilly's point and disproving my earlier assertion: use strict; use warnings; use Benchmark; my @stuff; while ( my $line = <DATA> ) { chomp $line; push @stuff, $line; } sub do_split { my @parsed; push( @parsed, split( /\\|/, $_ ) ) for @stuff; return scalar @parsed; } sub do_unpack { my @parsed; push( @parsed, unpack( 'a5xa5xa5xa5xa5xa5xa5xa5xa5xa5x', $_ ) ) for @stuff; return scalar @parsed; } my $count = 100000; timethese ( $count, { SPLIT => \&do_split, UNPACK => \&do_unpack } ); __DATA__ AAAAA\|BBBBB\|CCCCC\|DDDDD\|EEEEE\|11111\|22222\|33333\|44444\|55555\| FFFFF\|GGGGG\|HHHHH\|IIIII\|JJJJJ\|11111\|22222\|33333\|44444\|55555\| KKKKK\|LLLLL\|MMMMM\|NNNNN\|OOOOO\|11111\|22222\|33333\|44444\|55555\| PPPPP\|QQQQQ\|RRRRR\|SSSSS\|TTTTT\|11111\|22222\|33333\|44444\|55555\| UUUUU\|VVVVV\|WWWWW\|XXXXX\|YYYYY\|11111\|22222\|33333\|44444\|55555\| ZZZZZ\|aaaaa\|bbbbb\|ccccc\|ddddd\|11111\|22222\|33333\|44444\|55555\| AAAAA\|BBBBB\|CCCCC\|DDDDD\|EEEEE\|11111\|22222\|33333\|44444\|55555\| FFFFF\|GGGGG\|HHHHH\|IIIII\|JJJJJ\|11111\|22222\|33333\|44444\|55555\| KKKKK\|LLLLL\|MMMMM\|NNNNN\|OOOOO\|11111\|22222\|33333\|44444\|55555\| PPPPP\|QQQQQ\|RRRRR\|SSSSS\|TTTTT\|11111\|22222\|33333\|44444\|55555\| UUUUU\|VVVVV\|WWWWW\|XXXXX\|YYYYY\|11111\|22222\|33333\|44444\|55555\| ZZZZZ\|aaaaa\|bbbbb\|ccccc\|ddddd\|11111\|22222\|33333\|44444\|55555\| [download] The output: `Benchmark: timing 100000 iterations of SPLIT, UNPACK... SPLIT: 25 wallclock secs (23.92 usr + 0.01 sys = 23.93 CPU) @ 4178.16/s (n=100000) UNPACK: 26 wallclock secs (25.62 usr + 0.01 sys = 25.63 CPU) @ 3902.13/s (n=100000)` [download] Dave	[reply] [d/l] [select]