in reply to What is the most efficient way to split a long string (see body for details/constraints)?

split is pretty darn fast. But if in doubt, a benchmark confirms this:

#!/usr/bin/env perl use warnings; use strict; use List::Util qw/max/; use Benchmark qw/cmpthese/; use constant WITHTEST => 0; my $cols = 32; my $row = join "\t", map { sprintf("%02d",$_) x 16 } 0..($cols-1); my $data = ( $row . "\n" ) x 100; open my $fh, '<', \$data or die $!; my @wanted = (2,3,12..18,25..28,31); #my @wanted = (2,3,10..15); my $wanted_max = max @wanted; my @wanted2 = (0) x $cols; @wanted2[@wanted] = (1) x @wanted; my ($wanted_re) = map { qr/\A$_\n?\z/ } join '\t', map { $_?'([^\t\n]++)':'[^\t\n]++' } @wanted2; my $expect = join "\t", map { sprintf("%02d",$_) x 16 } @wanted; cmpthese(-2, { split => sub { seek $fh, 0, 0 or die; while (<$fh>) { chomp; my @sel = (split /\t/, $_, $cols)[@wanted]; if (WITHTEST) { die "@sel\n$expect\n" unless join("\t",@sel) eq $expect } } }, scan => sub { seek $fh, 0, 0 or die; while (<$fh>) { chomp; my ($pos,$i,$prevpos,@sel)=(0,0); while ( $pos>=0 && $i<=$wanted_max ) { $prevpos = $pos; $pos = index($_, "\t", $pos+1); push @sel, substr($_, $prevpos+1, ($pos<0 ? length : $pos)-$prevpos-1 ) if $wanted2[$i++]; } if (WITHTEST) { die "@sel\n$expect\n" unless join("\t",@sel) eq $expect } } }, regex => sub { seek $fh, 0, 0 or die; while (<$fh>) { my @sel = /$wanted_re/ or die $_; if (WITHTEST) { die "@sel\n$expect\n" unless join("\t",@sel) eq $expect } } }, fh => sub { seek $fh, 0, 0 or die; while ( my $line = <$fh> ) { chomp($line); open my $fh2, '<', \$line or die $!; local $/ = "\t"; my @sel; for my $i (0..$wanted_max) { my $d = <$fh2>; next unless $wanted2[$i]; chomp $d; push @sel, $d; } close $fh2; if (WITHTEST) { die "@sel\n$expect\n" unless join("\t",@sel) eq $expect } } }, }); __END__ Rate regex fh scan split regex 1456/s -- -11% -13% -68% fh 1643/s 13% -- -1% -64% scan 1665/s 14% 1% -- -64% split 4586/s 215% 179% 175% --

Remember that split is implemented internally in C, while the above scan is implemented in Perl. You could probably gain more speed than split if you implemented something like this in C, but then again, whether that's worth the effort depends on how much more speed you need. Update: For the sake of completeness: I'm not surprised the regex solution is slower, regexes are very powerful but working with fixed strings often outperforms them, and I added the fh solution because it was the fastest in the thread I linked to above.

  • Comment on Re: What is the most efficient way to split a long string (see body for details/constraints)?
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: What is the most efficient way to split a long string (see body for details/constraints)?
by mikegold10 (Acolyte) on Jun 21, 2019 at 18:55 UTC

    I wonder why the file handle approach (fh) is slower here, when it was faster in this test:

    Re: Is foreach split Optimized?

    Benchmark from post linked above:

    Test Code

    filehandle => sub { my @lines; open my $str_fh, "<", \$str or die "cannot open fh $!"; while (<$str_fh>) { chomp; s/o/i/g; push @lines, $_; } },

    Results (Perl 5.26, {Windows,Linux,MacOS,???}?)

    perlbrew exec bench_script.pl # Other versions of Perl, omitted for brevity - see original link # for the "gories..." ... perl-5.26.0 ========== Rate index regex split filehandle index 3.00/s -- -25% -49% -53% regex 3.98/s 33% -- -33% -37% split 5.91/s 97% 49% -- -6% filehandle 6.31/s 111% 59% 7% --

    Test using Perl 5.30 / 10 secs per test

    Here is what I get using Perl 5.30 with 10 seconds per iteration in order to facilitate more accuracy (Linux Kubuntu-VM 5.1.10-050110-generic #201906151034 SMP Sat Jun 15 10:36:59 UTC 2019 x86_64 GNU/Linux):

    Note #1: Keep in mind that this was run on a Linux VM on a Windows 10 host

    Note #2: As you can see from the rates, it is a really Really REALLY fast host, where "really fast" implies "World Record Holder" of sorts fast

    perl-5.30.0 =========== Rate regex index split filehandle regex 9.98/s -- -11% -32% -33% index 11.2/s 12% -- -23% -25% split 14.6/s 46% 30% -- -2% filehandle 14.9/s 49% 33% 2% --

    Note #3: Decided to run it on the Windows 10 host itself, but unfortunately I only have Perl 5.28.1 installed, so it's not an apple to apple comparison with above:

    perl-5.28.1 (Windows 10 Pro) ============================ Rate regex index filehandle split regex 8.09/s -- -14% -17% -33% index 9.46/s 17% -- -3% -22% filehandle 9.73/s 20% 3% -- -20% split 12.2/s 50% 29% 25% --

    Note #4: Surprisingly, the Linux version in a VM ran faster than the native Windows version. I attribute this to either a difference between the Perl versions or a bad build on the Windows side

    Note #5: Found the culprit on why it's slower on the Windows side (gcc was used instead of Visual C++):

    perl -V ======= ==> cc='gcc' ccflags =' -s -O2 -DWIN32 -DWIN64 -DCONSERVATIVE -D__USE_MINGW_ANS +I_STDIO -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLIC +IT_SYS -DUSE_PERLIO -fwrapv -fno-strict-aliasing -mms-bitfields' + optimize='-s -O2' cppflags='-DWIN32' ccversion='' gccversion='7.1.0' ... Built under MSWin32 Compiled at Dec 2 2018 14:30:03 @INC: C:/Strawberry/perl/site/lib C:/Strawberry/perl/vendor/lib C:/Strawberry/perl/lib
      I wonder why the file handle approach (fh) is slower here, when it was faster in this test: Re: Is foreach split Optimized?

      If I had to wager a guess, it might be because in this thread, the benchmark is setting up a new filehandle on every line of input, while in the other thread, it's only a single filehandle.

      By the way, in regards to speeding things up by compiling them, this might be a case where a script written for Will_the_Chill's RPerl could give performance benefits.

Re^2: What is the most efficient way to split a long string (see body for details/constraints)?
by Anonymous Monk on Jun 21, 2019 at 18:38 UTC
    Thanks for the detailed reply with benchmarks! I am considering pre-processing with GNU Cut and the doing the Perlly stuff on the result.

      I tried to edit my comment above, but I can't see any way to do this once a comment is posted (here on PerlMonks). Anyways, I really appreciate the amount of time and effort you put into creating (and executing) these benchmarks. They really opened my eyes to the pros and cons of the approaches presented.

      The how-to/cookbook format of the various implementations will definitely help guide myself and others when any of these approaches is the best fit for the task at hand.

        Figured out why I couldn't edit my comment - I wasn't logged in! Duh... (hiding head in shame)