thezip has asked for the wisdom of the Perl Monks concerning the following question:

G'day,

What is the fastest way to extract the data in a specified "column" from a delimited string?

I have many largish files (85 MB, 45,000 lines) and split isn't especially effective for these.

The following example works (ie. prints "fff").

use warnings; use strict; my $column = 6; my $delim = " "; my $s = qq(a bb ccc dddd eeee fff gggggg hh i jjjjjjjj); my($result) = $s =~ /(([^$delim]*)$delim?){$column}/; print $result, "\n";

Any suggestions?

Where do you want *them* to go today?

Replies are listed 'Best First'.
Re: What is the fastest way to extract data from a delimited string?
by BrowserUk (Patriarch) on Jan 10, 2003 at 03:37 UTC

    The only improvement I can see to your regex is to use non-capturing brackets for the outer set which speeds the process up very slightly. However, as [] indicates above, using split and limiting the captures is quicker (in my benchmark considerably so).

    The only way of doing this faster (that I thought to try) is using index and substr, which to my suprise turns out to be quicker still. The latter method is labelled as "scanem" in the results below.

    Benchmark: timing 10 iterations of , regexem, scanem, splitem ... regexem: 2 wallclock secs ( 1.86 usr + 0.00 sys = 1.86 CPU) @ 5 +.37/s (n=10), scanem: 1 wallclock secs ( 0.80 usr + 0.00 sys = 0.80 CPU) @ 12 +.48/s (n=10), splitem: 1 wallclock secs ( 1.20 usr + 0.00 sys = 1.20 CPU) @ 8 +.32/s (n=10), Rate regexem splitem scanem regexem 5.37/s -- -35% -57% splitem 8.32/s 55% -- -33% scanem 12.5/s 133% 50% -- 2000, 2000, 2000 c:\test>

    Some quick math shows that your lines average around 2000 chars, and I opted to generate test data with random lengths of between 5 and 12 chars (for approx 200 fields per line) and use a space as the separator.

    The benchmark pushes every field, extracted one at a time from the test data, onto an array. These arrays are compared at the end to ensure identical results.

    Benchmark code


    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: What is the fastest way to extract data from a delimited string?
by cchampion (Curate) on Jan 10, 2003 at 00:40 UTC

    Why is split not up to the task? It seems to me that it is faster than your regex.

    #!/usr/bin/perl -w use strict; use Benchmark; my $column = 6; my $delim = " "; my $s = qq(a bb ccc dddd eeee fff gggggg hh i jjjjjjjj); timethese (1000000,{ 'regex' => sub { my($result) = $s =~ /(([^$delim]*)$delim?){$column}/; }, 'split' => sub { my ($result) = (split /$delim/, $s)[$column] } }); __END__ Benchmark: timing 1000000 iterations of regex, split... regex: 10 wallclock secs ( 8.90 usr + -0.00 sys = 8.90 CPU) split: 9 wallclock secs ( 8.27 usr + 0.02 sys = 8.29 CPU) Benchmark: timing 1000000 iterations of regex, split... regex: 9 wallclock secs ( 8.69 usr + 0.09 sys = 8.78 CPU) split: 7 wallclock secs ( 7.87 usr + -0.00 sys = 7.87 CPU)

    The only difference is that split will start numbering your columns from 0, so to get the 6th column you should use $column = 5;

      Your results seem convincing, yet everything I have heard about split is that it not the most efficient solution.

      My instincts tell me that there is a regex which will outperform both split and the regex I have submitted for criticism.

      Vannah, I'd like to buy a *REGEX* please...

      Where do you want *them* to go today?
        You're out of luck then because split will *always* be much faster than any non-trivial regex. For large files, split is the only way to go. Your instincts are wrong, completely wrong.

        --
        Regards,
        Helgi Briem
        helgi AT decode DOT is

Re: What is the fastest way to extract data from a delimited string?
by sauoq (Abbot) on Jan 10, 2003 at 00:29 UTC

    Text::CSV_XS (Your delimiter doesn't have to be a comma despite the name.)

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: What is the fastest way to extract data from a delimited string?
by dpuu (Chaplain) on Jan 10, 2003 at 00:05 UTC
    One thing that might help, with split, is to use the third arg to limit the number of columns it extracts. This will be especially helpful if there are a lot of columns to the right of the one that you want. --Dave

      I have many files where $column will be near the end of the string, so this won't provide much benefit.

      I'm actually hoping that there is speedy regex solution for this...

      Where do you want *them* to go today?
Re: What is the fastest way to extract data from a delimited string?
by runrig (Abbot) on Jan 10, 2003 at 00:18 UTC
    If you only need up to the 6th column, you might try supplying the third argument to split so it will stop splitting after the 6th column. If you're on Unix or have unix tools, and you only need the one column, you might try using cut or awk:
    open(FH, 'cut -d" " -c6 file|') or die $!; # or open(FH, "awk '{print \$6}' file|") ... while(<FH>) { ... } close FH or die $!; # Check close on pipe opens # Also, you might try a regex which # uses qr and only saves what you want my $re = qr/(?:\S+\s+){5}(\S+)/; ... while (<FH>) { next unless $_ =~ $re; my $field = $1; ... }
    Updated.
Re: What is the fastest way to extract data from a delimited string?
by jmcnamara (Monsignor) on Jan 10, 2003 at 09:11 UTC

    You could write this as a simple one-liner as follows, see perlrun for an explanation of the command line options:     perl -lane 'print $F[5]' file

    Or to get the sixth column only if there is one:

    perl -lane 'print $F[5] if @F > 5' file perl -lane 'print $F[5] if defined $F[5]' file

    You can also specify a delimiter pattern using the -F flag. Here is an example for *simple* comma separated data:     perl -F, -lane 'print $F[5]' file

    There is also the Unix cut utility and the modules Text::CSV_XS and Text::xSV.

    And finally, since perl -lane is just a synonym fo awk: ;-)     awk '{print $6}' file

    --
    John.

Re: What is the fastest way to extract data from a delimited string?
by thezip (Vicar) on Jan 10, 2003 at 05:12 UTC

    Thank you all for your enlightening comments! I will absorb them into my personal knowledgebase. I must admit, though, that I'm a little surprised with the facts as they have been presented.

    Thank you BrowserUK and cchampion for reminding me that benchmarking always wins out over gut instincts.

    Where do you want *them* to go today?
Re: What is the fastest way to extract data from a delimited string?
by John M. Dlugosz (Monsignor) on Jan 10, 2003 at 16:06 UTC
    If split isn't doing it for you, there seems to be no faster way in pure Perl.

    You could write a filter in C. Using a memory-mapped file (or large buffer) it should be I/O bound and take no appreciable time to locate the delimiters. Copy them to standard output.

    Then, call that as a pipe-in from the Perl open-file command.

    —John

Re: What is the fastest way to extract data from a delimited string?
by JamesNC (Chaplain) on Jan 10, 2003 at 15:31 UTC
    ... yet another way
    use strict; my (@list) = map { /\w+\s+\w+\s+\w+\s+\w+\s+\w+\s+(\w+)\s+.*/} <DATA>; print "@list"; __DATA__ 1 2 3 4 5 6 7 8 james joe jack janet jill joann joeanne bill bob alice joe jack janet jill alice joeanne bill bob james joe jack janet jill susan joeanne bill bob james joe jack janet jill sarah joeanne bill bob james joe jack janet jill ethel joeanne bill bob
    Only using map... very brief though :)
      Not very fast or maintenable, though. How about something like:
      sub read_stuff { my ($handle, $delim, $col) = @_; my $prep = '^\s*'; $prep .= "\w+${delim}+" for 1 .. $col - 1; $prep .= "(\w+)(?:${delim}|$)"; my $regex = qr/$prep/; my @list = map { /$regex/o } <$handle>; return @list; } my $handle = IO::File->new(filename => "some_file"); my @list = read_stuff($handle, ' ', 6);
      Remember something - "fast" is a relative term. You might squeeze an additional millisecond from your CPU, but require an extra 8 hours every time you want to make a change to your code. To me, that's not worth it. My 8 hours is worth more than one millisecond of CPU time. To me, "fast" has to do with acceptable CPU progress and lightning-quick maintenance time. YMMV.

      ------
      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Re: What is the fastest way to extract data from a delimited string?
by hardburn (Abbot) on Jan 10, 2003 at 17:05 UTC

    Is there any reason why you can't move this stuff into a real database instead of a flat file? If you can't install MySQL or PostgreSQL or *insert favoriate RDMS*, you can get the DBD::SQLite, which embeds an RDMS right into the module itself.

Re: What is the fastest way to extract data from a delimited string?
by Anonymous Monk on Jan 11, 2003 at 08:22 UTC
    substr and index