What is the fastest way to extract data from a delimited string?

thezip has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: What is the fastest way to extract data from a delimited string? by BrowserUk (Patriarch) on Jan 10, 2003 at 03:37 UTC
The only improvement I can see to your regex is to use non-capturing brackets for the outer set which speeds the process up very slightly. However, as [] indicates above, using split and limiting the captures is quicker (in my benchmark considerably so). The only way of doing this faster (that I thought to try) is using index and substr, which to my suprise turns out to be quicker still. The latter method is labelled as "scanem" in the results below. `Benchmark: timing 10 iterations of , regexem, scanem, splitem ... regexem: 2 wallclock secs ( 1.86 usr + 0.00 sys = 1.86 CPU) @ 5 +.37/s (n=10), scanem: 1 wallclock secs ( 0.80 usr + 0.00 sys = 0.80 CPU) @ 12 +.48/s (n=10), splitem: 1 wallclock secs ( 1.20 usr + 0.00 sys = 1.20 CPU) @ 8 +.32/s (n=10), Rate regexem splitem scanem regexem 5.37/s -- -35% -57% splitem 8.32/s 55% -- -33% scanem 12.5/s 133% 50% -- 2000, 2000, 2000 c:\test>` [download] Some quick math shows that your lines average around 2000 chars, and I opted to generate test data with random lengths of between 5 and 12 chars (for approx 200 fields per line) and use a space as the separator. The benchmark pushes every field, extracted one at a time from the test data, onto an array. These arrays are compared at the end to ensure identical results. Benchmark code Read more... (2 kB) Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.	[reply] [d/l] [select]
Re: What is the fastest way to extract data from a delimited string? by cchampion (Curate) on Jan 10, 2003 at 00:40 UTC
Why is split not up to the task? It seems to me that it is faster than your regex. #!/usr/bin/perl -w use strict; use Benchmark; my $column = 6; my $delim = " "; my $s = qq(a bb ccc dddd eeee fff gggggg hh i jjjjjjjj); timethese (1000000,{ 'regex' => sub { my($result) = $s =~ /(([^$delim]*)$delim?){$column}/; }, 'split' => sub { my ($result) = (split /$delim/, $s)[$column] } }); __END__ Benchmark: timing 1000000 iterations of regex, split... regex: 10 wallclock secs ( 8.90 usr + -0.00 sys = 8.90 CPU) split: 9 wallclock secs ( 8.27 usr + 0.02 sys = 8.29 CPU) Benchmark: timing 1000000 iterations of regex, split... regex: 9 wallclock secs ( 8.69 usr + 0.09 sys = 8.78 CPU) split: 7 wallclock secs ( 7.87 usr + -0.00 sys = 7.87 CPU) [download] The only difference is that split will start numbering your columns from 0, so to get the 6th column you should use $column = 5;	[reply] [d/l]
Re: Re: What is the fastest way to extract data from a delimited string? by thezip (Vicar) on Jan 10, 2003 at 01:10 UTC
Your results seem convincing, yet everything I have heard about split is that it not the most efficient solution. My instincts tell me that there is a regex which will outperform both split and the regex I have submitted for criticism. Vannah, I'd like to buy a REGEX please... Where do you want them* to go today?*	[reply]
Re: Re: Re: What is the fastest way to extract data from a delimited string? by helgi (Hermit) on Jan 10, 2003 at 12:55 UTC
You're out of luck then because split will always be much faster than any non-trivial regex. For large files, split is the only way to go. Your instincts are wrong, completely wrong. -- Regards, Helgi Briem helgi AT decode DOT is	[reply]
Re: What is the fastest way to extract data from a delimited string? by sauoq (Abbot) on Jan 10, 2003 at 00:29 UTC
Text::CSV_XS (Your delimiter doesn't have to be a comma despite the name.) -sauoq "My two cents aren't worth a dime.";	[reply]
Re: What is the fastest way to extract data from a delimited string? by dpuu (Chaplain) on Jan 10, 2003 at 00:05 UTC
One thing that might help, with split, is to use the third arg to limit the number of columns it extracts. This will be especially helpful if there are a lot of columns to the right of the one that you want. --Dave	[reply]
Re: Re: What is the fastest way to extract data from a delimited string? by thezip (Vicar) on Jan 10, 2003 at 00:18 UTC
I have many files where $column will be near the end of the string, so this won't provide much benefit. I'm actually hoping that there is speedy regex solution for this... Where do you want them* to go today?*	[reply]
Re: What is the fastest way to extract data from a delimited string? by runrig (Abbot) on Jan 10, 2003 at 00:18 UTC
If you only need up to the 6th column, you might try supplying the third argument to split so it will stop splitting after the 6th column. If you're on Unix or have unix tools, and you only need the one column, you might try using cut or awk: `open(FH, 'cut -d" " -c6 file\|') or die $!; # or open(FH, "awk '{print \$6}' file\|") ... while(<FH>) { ... } close FH or die $!; # Check close on pipe opens # Also, you might try a regex which # uses qr and only saves what you want my $re = qr/(?:\S+\s+){5}(\S+)/; ... while (<FH>) { next unless $_ =~ $re; my $field = $1; ... }` [download] Updated.	[reply] [d/l]
Re: What is the fastest way to extract data from a delimited string? by jmcnamara (Monsignor) on Jan 10, 2003 at 09:11 UTC
You could write this as a simple one-liner as follows, see perlrun for an explanation of the command line options: `perl -lane 'print $F[5]' file` Or to get the sixth column only if there is one: `perl -lane 'print $F[5] if @F > 5' file perl -lane 'print $F[5] if defined $F[5]' file` [download] You can also specify a delimiter pattern using the `-F` flag. Here is an example for simple comma separated data: `perl -F, -lane 'print $F[5]' file` There is also the Unix `cut` utility and the modules Text::CSV_XS and Text::xSV. And finally, since `perl -lane` is just a synonym fo `awk`: `;-)` `awk '{print $6}' file` -- John.	[reply] [d/l] [select]
Re: What is the fastest way to extract data from a delimited string? by thezip (Vicar) on Jan 10, 2003 at 05:12 UTC
Thank you all for your enlightening comments! I will absorb them into my personal knowledgebase. I must admit, though, that I'm a little surprised with the facts as they have been presented. Thank you BrowserUK and cchampion for reminding me that benchmarking always wins out over gut instincts. Where do you want them* to go today?*	[reply]
Re: What is the fastest way to extract data from a delimited string? by John M. Dlugosz (Monsignor) on Jan 10, 2003 at 16:06 UTC
If split isn't doing it for you, there seems to be no faster way in pure Perl. You could write a filter in C. Using a memory-mapped file (or large buffer) it should be I/O bound and take no appreciable time to locate the delimiters. Copy them to standard output. Then, call that as a pipe-in from the Perl open-file command. —John	[reply]
Re: What is the fastest way to extract data from a delimited string? by JamesNC (Chaplain) on Jan 10, 2003 at 15:31 UTC
... yet another way `use strict; my (@list) = map { /\w+\s+\w+\s+\w+\s+\w+\s+\w+\s+(\w+)\s+.*/} <DATA>; print "@list"; __DATA__ 1 2 3 4 5 6 7 8 james joe jack janet jill joann joeanne bill bob alice joe jack janet jill alice joeanne bill bob james joe jack janet jill susan joeanne bill bob james joe jack janet jill sarah joeanne bill bob james joe jack janet jill ethel joeanne bill bob` [download] Only using map... very brief though :)	[reply] [d/l]
Re: Re: What is the fastest way to extract data from a delimited string? by dragonchild (Archbishop) on Jan 10, 2003 at 15:49 UTC
Not very fast or maintenable, though. How about something like: `sub read_stuff { my ($handle, $delim, $col) = @_; my $prep = '^\s'; $prep .= "\w+${delim}+" for 1 .. $col - 1; $prep .= "(\w+)(?:${delim}\|$)"; my $regex = qr/$prep/; my @list = map { /$regex/o } <$handle>; return @list; } my $handle = IO::File->new(filename => "some_file"); my @list = read_stuff($handle, ' ', 6);` [download] Remember something - "fast" is a relative term. You might squeeze an additional millisecond from your CPU, but require an extra 8 hours every time you want to make a change to your code. To me, that's not worth it. My 8 hours is worth more than one millisecond of CPU time. To me, "fast" has to do with acceptable CPU progress and lightning-quick maintenance time. YMMV. ------ We are the carpenters and bricklayers of the Information Age.* Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.	[reply] [d/l]
Re: What is the fastest way to extract data from a delimited string? by hardburn (Abbot) on Jan 10, 2003 at 17:05 UTC
Is there any reason why you can't move this stuff into a real database instead of a flat file? If you can't install MySQL or PostgreSQL or insert favoriate RDMS, you can get the DBD::SQLite, which embeds an RDMS right into the module itself.	[reply]
Re: What is the fastest way to extract data from a delimited string? by Anonymous Monk on Jan 11, 2003 at 08:22 UTC
substr and index	[reply]