Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

What is the most efficient way to split a long string (see body for details/constraints)?

by mikegold10 (Acolyte)
on Jun 19, 2019 at 14:43 UTC ( [id://11101571]=perlquestion: print w/replies, xml ) Need Help??

mikegold10 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

The details of my query follow:

  • I have a very large TSV (Tab Sep. Value) file (where large > 30 GB).
  • I want to extract certain lines from this file that don't end with an empty last field. Since this is a TSV file, those lines that don't end with \t\n, which is a trivial test and not the subject of this question. That will remove some 75% of the lines, right off the bat.
  • Then I want to extract a small subset of fields from the remaining lines. The fields are not contiguous, but they are few in number (e.g., let's say seven out of thirty). For example, say fields 2,3,12-18,25-28,31.
  • The lines I am extracting from are very long, most are as long as 1,000 characters, because they contain a large number of tab delimited fields.
  • One option is to obviously use the following simple code, which I've tried to nicely format and include comments to show my reasoning:

    use warnings; use strict; # I am using the latest stable version of Perl for this exercise use 5.30.0; while (<>) { # Skip lines ending with an empty field next if substr($_,-2) eq "\t\n"; # Remove "\n" chomp; # Split matching lines into fields on "\t", creating @fields my @fields=split(/\t/,$_); # Copy only the desired fields from @fields to create a new # line in TSV format # This can be done in one simple step in Perl, using # array slices and the join() function my $new_line=join("\t",@fields[2,3,12..18,25..28,31]); # ... }

    But, using split, although easy, leads to extra parsing (beyond the last field I need) and produces a complete array of fields which I also don't need. I think it would be more efficient to not create the array, but to parse each line looking for tabs and counting the field indexes as I go, creating the output line on the way, and stopping at the last field I need.

    Am I correct in my assessment, or is just doing a simple split, followed by joined slices of the fields of interest, the best way to go here from a performance perspective?

    Replies are listed 'Best First'.
    Re: What is the most efficient way to split a long string (see body for details/constraints)?
    by haukex (Archbishop) on Jun 19, 2019 at 18:36 UTC

      split is pretty darn fast. But if in doubt, a benchmark confirms this:

      #!/usr/bin/env perl use warnings; use strict; use List::Util qw/max/; use Benchmark qw/cmpthese/; use constant WITHTEST => 0; my $cols = 32; my $row = join "\t", map { sprintf("%02d",$_) x 16 } 0..($cols-1); my $data = ( $row . "\n" ) x 100; open my $fh, '<', \$data or die $!; my @wanted = (2,3,12..18,25..28,31); #my @wanted = (2,3,10..15); my $wanted_max = max @wanted; my @wanted2 = (0) x $cols; @wanted2[@wanted] = (1) x @wanted; my ($wanted_re) = map { qr/\A$_\n?\z/ } join '\t', map { $_?'([^\t\n]++)':'[^\t\n]++' } @wanted2; my $expect = join "\t", map { sprintf("%02d",$_) x 16 } @wanted; cmpthese(-2, { split => sub { seek $fh, 0, 0 or die; while (<$fh>) { chomp; my @sel = (split /\t/, $_, $cols)[@wanted]; if (WITHTEST) { die "@sel\n$expect\n" unless join("\t",@sel) eq $expect } } }, scan => sub { seek $fh, 0, 0 or die; while (<$fh>) { chomp; my ($pos,$i,$prevpos,@sel)=(0,0); while ( $pos>=0 && $i<=$wanted_max ) { $prevpos = $pos; $pos = index($_, "\t", $pos+1); push @sel, substr($_, $prevpos+1, ($pos<0 ? length : $pos)-$prevpos-1 ) if $wanted2[$i++]; } if (WITHTEST) { die "@sel\n$expect\n" unless join("\t",@sel) eq $expect } } }, regex => sub { seek $fh, 0, 0 or die; while (<$fh>) { my @sel = /$wanted_re/ or die $_; if (WITHTEST) { die "@sel\n$expect\n" unless join("\t",@sel) eq $expect } } }, fh => sub { seek $fh, 0, 0 or die; while ( my $line = <$fh> ) { chomp($line); open my $fh2, '<', \$line or die $!; local $/ = "\t"; my @sel; for my $i (0..$wanted_max) { my $d = <$fh2>; next unless $wanted2[$i]; chomp $d; push @sel, $d; } close $fh2; if (WITHTEST) { die "@sel\n$expect\n" unless join("\t",@sel) eq $expect } } }, }); __END__ Rate regex fh scan split regex 1456/s -- -11% -13% -68% fh 1643/s 13% -- -1% -64% scan 1665/s 14% 1% -- -64% split 4586/s 215% 179% 175% --

      Remember that split is implemented internally in C, while the above scan is implemented in Perl. You could probably gain more speed than split if you implemented something like this in C, but then again, whether that's worth the effort depends on how much more speed you need. Update: For the sake of completeness: I'm not surprised the regex solution is slower, regexes are very powerful but working with fixed strings often outperforms them, and I added the fh solution because it was the fastest in the thread I linked to above.

        I wonder why the file handle approach (fh) is slower here, when it was faster in this test:

        Re: Is foreach split Optimized?

        Benchmark from post linked above:

        Test Code

        filehandle => sub { my @lines; open my $str_fh, "<", \$str or die "cannot open fh $!"; while (<$str_fh>) { chomp; s/o/i/g; push @lines, $_; } },

        Results (Perl 5.26, {Windows,Linux,MacOS,???}?)

        perlbrew exec bench_script.pl # Other versions of Perl, omitted for brevity - see original link # for the "gories..." ... perl-5.26.0 ========== Rate index regex split filehandle index 3.00/s -- -25% -49% -53% regex 3.98/s 33% -- -33% -37% split 5.91/s 97% 49% -- -6% filehandle 6.31/s 111% 59% 7% --

        Test using Perl 5.30 / 10 secs per test

        Here is what I get using Perl 5.30 with 10 seconds per iteration in order to facilitate more accuracy (Linux Kubuntu-VM 5.1.10-050110-generic #201906151034 SMP Sat Jun 15 10:36:59 UTC 2019 x86_64 GNU/Linux):

        Note #1: Keep in mind that this was run on a Linux VM on a Windows 10 host

        Note #2: As you can see from the rates, it is a really Really REALLY fast host, where "really fast" implies "World Record Holder" of sorts fast

        perl-5.30.0 =========== Rate regex index split filehandle regex 9.98/s -- -11% -32% -33% index 11.2/s 12% -- -23% -25% split 14.6/s 46% 30% -- -2% filehandle 14.9/s 49% 33% 2% --

        Note #3: Decided to run it on the Windows 10 host itself, but unfortunately I only have Perl 5.28.1 installed, so it's not an apple to apple comparison with above:

        perl-5.28.1 (Windows 10 Pro) ============================ Rate regex index filehandle split regex 8.09/s -- -14% -17% -33% index 9.46/s 17% -- -3% -22% filehandle 9.73/s 20% 3% -- -20% split 12.2/s 50% 29% 25% --

        Note #4: Surprisingly, the Linux version in a VM ran faster than the native Windows version. I attribute this to either a difference between the Perl versions or a bad build on the Windows side

        Note #5: Found the culprit on why it's slower on the Windows side (gcc was used instead of Visual C++):

        perl -V ======= ==> cc='gcc' ccflags =' -s -O2 -DWIN32 -DWIN64 -DCONSERVATIVE -D__USE_MINGW_ANS +I_STDIO -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLIC +IT_SYS -DUSE_PERLIO -fwrapv -fno-strict-aliasing -mms-bitfields' + optimize='-s -O2' cppflags='-DWIN32' ccversion='' gccversion='7.1.0' ... Built under MSWin32 Compiled at Dec 2 2018 14:30:03 @INC: C:/Strawberry/perl/site/lib C:/Strawberry/perl/vendor/lib C:/Strawberry/perl/lib
          I wonder why the file handle approach (fh) is slower here, when it was faster in this test: Re: Is foreach split Optimized?

          If I had to wager a guess, it might be because in this thread, the benchmark is setting up a new filehandle on every line of input, while in the other thread, it's only a single filehandle.

          By the way, in regards to speeding things up by compiling them, this might be a case where a script written for Will_the_Chill's RPerl could give performance benefits.

        Thanks for the detailed reply with benchmarks! I am considering pre-processing with GNU Cut and the doing the Perlly stuff on the result.

          I tried to edit my comment above, but I can't see any way to do this once a comment is posted (here on PerlMonks). Anyways, I really appreciate the amount of time and effort you put into creating (and executing) these benchmarks. They really opened my eyes to the pros and cons of the approaches presented.

          The how-to/cookbook format of the various implementations will definitely help guide myself and others when any of these approaches is the best fit for the task at hand.

    Re: What is the most efficient way to split a long string (see body for details/constraints)?
    by Corion (Patriarch) on Jun 19, 2019 at 15:27 UTC

      If your lines have substantially more fields than 31, you can use the LIMIT parameter to split:

      my @fields = split /\t/, $_, 32;

      This prevents split from doing unnecessary work, but you should benchmark whether it actually makes a difference.

        OK, so using limit answers my question on early termination (of the parse of each line). But, I don't feel that this is the most important aspect, performance wise, since I do desire a field near the end of the line.
    Re: What is the most efficient way to split a long string (see body for details/constraints)?
    by hippo (Bishop) on Jun 19, 2019 at 15:27 UTC
      the best way to go here from a performance perspective?

      Best performance means different things to different people. You might want to minimize any, some or all of: inaccuracies, RAM usage, CPU usage, wallclock time, etc.

      If we assume wallclock time then the limiting factor is likely to be just reading 30GB of data in the first place. Anything you do with split or substr will probably be dwarfed by that. Start by profiling and see where the bottlenecks lie.

        Best performance in terms of wall-time...
    Re: What is the most efficient way to split a long string (see body for details/constraints)?
    by marioroy (Prior) on Jun 20, 2019 at 04:31 UTC

      Hi mikegold10,

      Tonight pondered over this site and came across your post. It's been a while since I last posted here.

      MCE's chunking engine is beneficial for your case. Moreover, MCE::Relay makes it possible to run serially and orderly if needed. Notice how the parallel code looks very much like the serial code. The extra bit is that workers loop over the chunk.

      use warnings; use strict; use 5.30.0; use MCE::Loop; MCE::Loop::init( max_workers => 4, init_relay => 1, # enables MCE::Relay feature ); mce_loop_f { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $output = ''; for my $line ( @{ $chunk_ref } ) { # Skip lines ending with an empty field next if substr($line, -2) eq "\t\n"; # Remove "\n"; chomp $line; # Split matching lines into fields on "\t", creating @fields my @fields = split /\t/, $line; # Copy only the desired fields from @fields to create a new # line in TSV format # This can be done in one simple step in Perl, using # array slices and the join() function my $new_line = join "\t", @fields[ 2, 3, 12..18, 25..28, 31 ]; # Append to buffer with newline char $output .= $new_line . "\n";; } # The MCE relay takes a code block and runs serially # including orderly, one worker at a time. Orderly is # driven by the chunk_id value behind the scene. # Thus, must call MCE::relay per each chunk. MCE::relay { print $output; STDOUT->flush; }; } \*STDIN; # This signals the workers to exit. # If omitted, called automatically when the script terminates. MCE::Loop::finish;

      Regards, Mario

    Re: What is the most efficient way to split a long string (see body for details/constraints)?
    by shadowsong (Pilgrim) on Jun 19, 2019 at 15:35 UTC

      Hi mikegold10

      Maybe you can still achieve what you want with very minor changes to your existing code

      But, using split, although easy, leads to extra parsing (beyond the last field I need) and produces a complete array of fields which I also don't need.

      Really? Try using this split in this form:

      split /PATTERN/,EXPR,LIMIT

      If LIMIT is specified and positive, it represents the maximum number of fields into which the EXPR may be split; in other words, LIMIT is one greater than the maximum number of times EXPR may be split. Thus, the LIMIT value 1 means that EXPR may be split a maximum of zero times, producing a maximum of one field (namely, the entire value of EXPR).

      See split - perldoc.perl.org

      Cheers,
      Shadowsong

    Re: What is the most efficient way to split a long string (see body for details/constraints)?
    by jwkrahn (Abbot) on Jun 19, 2019 at 18:14 UTC

      You can do what you want without creating an array by using a list slice:

      # Copy only the desired fields to create a new # line in TSV format # This can be done in one simple step in Perl, using # list slices and the join() function my $new_line = join "\t", ( split /\t/, $_ )[ 2, 3, 12 .. 18, 25 .. +28, 31 ]; # ...
        If this in fact prevents a list temporary object from being created (i.e., slices in-place), it is an interesting approach I will consider...
    Re: What is the most efficient way to split a long string (see body for details/constraints)?
    by GrandFather (Saint) on Jun 19, 2019 at 21:26 UTC
      1. How long does it take now?
      2. What's the longest time it can be allowed to take?
      3. How many times a day do you need to do it?
      4. Why is it a CSV file rather than a database?
      Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
    Re: What is the most efficient way to split a long string (see body for details/constraints)?
    by holli (Abbot) on Jun 20, 2019 at 05:26 UTC
      Edit: Bummer, this doesn't really work for you since that method can only be used on a whole file rather than separate lines.

      I don't say it's the fastest way, but this is definitly the most flexible and convenient. This also sets you up for future changes in the field selection, you can even put the fragment-identifier into a config file or something.
      use Modern::Perl; use Text::CSV_XS; use Data::Dumper; my $csv = Text::CSV_XS->new({ sep_char=> "," }); #normally this would be a real file handle my $line = "a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z"; open my $fh, "<", \$line; my $row = $csv->fragment($fh, "col=1;3;10-12;23-*"); print Dumper($row);
      Output
      $VAR1 = [ [ 'a', 'c', 'j', 'k', 'l', 'x', 'y', 'z' ] ];


      holli

      You can lead your users to water, but alas, you cannot drown them.
        I appreciate your answer, from a convenience perspective when speed is not an issue. i was not aware of the ease with which Text::CSV_XS can handle field extract. And, with the _XS part maybe it is performant as well?!
    Re: What is the most efficient way to split a long string (see body for details/constraints)?
    by choroba (Cardinal) on Jun 19, 2019 at 15:21 UTC
      Crossposted to StackOverflow. Crossposting is OK, but it's considered polite to inform about it to prevent unnecessary work of people not attending both sites.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

    Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Domain Nodelet?
    Node Status?
    node history
    Node Type: perlquestion [id://11101571]
    Front-paged by Corion
    help
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this?Last hourOther CB clients
    Other Users?
    Others lurking in the Monastery: (8)
    As of 2024-03-28 17:07 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found