in reply to Re: Re: Re: Need to process a tab delimted file *FAST*
in thread Need to process a tab delimted file *FAST*

Can you back up your assertion that reving up the regexp engine has a significant slowdown compared to non-regexp alternatives?

My understanding is that the regular expression engine will recognize when a regular expression just searches for a constant string and switches to the same Boyer-Moore optimization that index uses. Solutions like unpack can win because they get rid of having loops written in Perl. Not because they walk through the string significantly faster.

Yes, there is overhead to regular expressions, but it is truly marginal.

  • Comment on Re: Re: Re: Re: Need to process a tab delimted file *FAST*

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: Need to process a tab delimted file *FAST*
by davido (Cardinal) on Mar 03, 2004 at 18:19 UTC
    You know what, tilly is right. I always get into trouble every time I assume that unpack will do its job faster than a regexp.

    Here's a test snippet proving tilly's point and disproving my earlier assertion:

    use strict; use warnings; use Benchmark; my @stuff; while ( my $line = <DATA> ) { chomp $line; push @stuff, $line; } sub do_split { my @parsed; push( @parsed, split( /\|/, $_ ) ) for @stuff; return scalar @parsed; } sub do_unpack { my @parsed; push( @parsed, unpack( 'a5xa5xa5xa5xa5xa5xa5xa5xa5xa5x', $_ ) ) for @stuff; return scalar @parsed; } my $count = 100000; timethese ( $count, { SPLIT => \&do_split, UNPACK => \&do_unpack } ); __DATA__ AAAAA|BBBBB|CCCCC|DDDDD|EEEEE|11111|22222|33333|44444|55555| FFFFF|GGGGG|HHHHH|IIIII|JJJJJ|11111|22222|33333|44444|55555| KKKKK|LLLLL|MMMMM|NNNNN|OOOOO|11111|22222|33333|44444|55555| PPPPP|QQQQQ|RRRRR|SSSSS|TTTTT|11111|22222|33333|44444|55555| UUUUU|VVVVV|WWWWW|XXXXX|YYYYY|11111|22222|33333|44444|55555| ZZZZZ|aaaaa|bbbbb|ccccc|ddddd|11111|22222|33333|44444|55555| AAAAA|BBBBB|CCCCC|DDDDD|EEEEE|11111|22222|33333|44444|55555| FFFFF|GGGGG|HHHHH|IIIII|JJJJJ|11111|22222|33333|44444|55555| KKKKK|LLLLL|MMMMM|NNNNN|OOOOO|11111|22222|33333|44444|55555| PPPPP|QQQQQ|RRRRR|SSSSS|TTTTT|11111|22222|33333|44444|55555| UUUUU|VVVVV|WWWWW|XXXXX|YYYYY|11111|22222|33333|44444|55555| ZZZZZ|aaaaa|bbbbb|ccccc|ddddd|11111|22222|33333|44444|55555|

    The output:

    Benchmark: timing 100000 iterations of SPLIT, UNPACK... SPLIT: 25 wallclock secs (23.92 usr + 0.01 sys = 23.93 CPU) @ 4178.16/s (n=100000) UNPACK: 26 wallclock secs (25.62 usr + 0.01 sys = 25.63 CPU) @ 3902.13/s (n=100000)


    Dave