Re: What is the fastest way to extract data from a delimited string?
by BrowserUk (Patriarch) on Jan 10, 2003 at 03:37 UTC
|
The only improvement I can see to your regex is to use non-capturing brackets for the outer set which speeds the process up very slightly. However, as [] indicates above, using split and limiting the captures is quicker (in my benchmark considerably so).
The only way of doing this faster (that I thought to try) is using index and substr, which to my suprise turns out to be quicker still. The latter method is labelled as "scanem" in the results below.
Benchmark:
timing 10 iterations of
, regexem, scanem, splitem
...
regexem: 2 wallclock secs ( 1.86 usr + 0.00 sys = 1.86 CPU) @ 5
+.37/s (n=10),
scanem: 1 wallclock secs ( 0.80 usr + 0.00 sys = 0.80 CPU) @ 12
+.48/s (n=10),
splitem: 1 wallclock secs ( 1.20 usr + 0.00 sys = 1.20 CPU) @ 8
+.32/s (n=10),
Rate regexem splitem scanem
regexem 5.37/s -- -35% -57%
splitem 8.32/s 55% -- -33%
scanem 12.5/s 133% 50% --
2000, 2000, 2000
c:\test>
Some quick math shows that your lines average around 2000 chars, and I opted to generate test data with random lengths of between 5 and 12 chars (for approx 200 fields per line) and use a space as the separator.
The benchmark pushes every field, extracted one at a time from the test data, onto an array. These arrays are compared at the end to ensure identical results.
Benchmark code
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead. | [reply] [d/l] [select] |
Re: What is the fastest way to extract data from a delimited string?
by cchampion (Curate) on Jan 10, 2003 at 00:40 UTC
|
#!/usr/bin/perl -w
use strict;
use Benchmark;
my $column = 6;
my $delim = " ";
my $s = qq(a bb ccc dddd eeee fff gggggg hh i jjjjjjjj);
timethese (1000000,{
'regex' => sub {
my($result) = $s =~ /(([^$delim]*)$delim?){$column}/;
},
'split' => sub {
my ($result) = (split /$delim/, $s)[$column]
}
});
__END__
Benchmark: timing 1000000 iterations of regex, split...
regex: 10 wallclock secs ( 8.90 usr + -0.00 sys = 8.90 CPU)
split: 9 wallclock secs ( 8.27 usr + 0.02 sys = 8.29 CPU)
Benchmark: timing 1000000 iterations of regex, split...
regex: 9 wallclock secs ( 8.69 usr + 0.09 sys = 8.78 CPU)
split: 7 wallclock secs ( 7.87 usr + -0.00 sys = 7.87 CPU)
The only difference is that split will start numbering your columns from 0, so to get the 6th column you should use $column = 5; | [reply] [d/l] |
|
|
Your results seem convincing, yet everything I have heard about split is that it not the most efficient solution.
My instincts tell me that there is a regex which will outperform both split and the regex I have submitted for criticism.
Vannah, I'd like to buy a *REGEX* please...
Where do you want *them* to go today?
| [reply] |
|
|
| [reply] |
Re: What is the fastest way to extract data from a delimited string?
by sauoq (Abbot) on Jan 10, 2003 at 00:29 UTC
|
-sauoq
"My two cents aren't worth a dime.";
| [reply] |
Re: What is the fastest way to extract data from a delimited string?
by dpuu (Chaplain) on Jan 10, 2003 at 00:05 UTC
|
One thing that might help, with split, is to use the third arg to limit the number of columns it extracts. This will be especially helpful if there are a lot of columns to the right of the one that you want.
--Dave | [reply] |
|
|
I have many files where $column will be near the end of the string, so this won't provide much benefit.
I'm actually hoping that there is speedy regex solution for this...
Where do you want *them* to go today?
| [reply] |
Re: What is the fastest way to extract data from a delimited string?
by runrig (Abbot) on Jan 10, 2003 at 00:18 UTC
|
If you only need up to the 6th column, you might try supplying the third argument to split so it will stop splitting after the 6th column. If you're on Unix or have unix tools, and you only need the one column, you might try using cut or awk: open(FH, 'cut -d" " -c6 file|') or die $!;
# or open(FH, "awk '{print \$6}' file|") ...
while(<FH>) {
...
}
close FH or die $!; # Check close on pipe opens
# Also, you might try a regex which
# uses qr and only saves what you want
my $re = qr/(?:\S+\s+){5}(\S+)/;
...
while (<FH>) {
next unless $_ =~ $re;
my $field = $1;
...
}
Updated. | [reply] [d/l] |
Re: What is the fastest way to extract data from a delimited string?
by jmcnamara (Monsignor) on Jan 10, 2003 at 09:11 UTC
|
perl -lane 'print $F[5] if @F > 5' file
perl -lane 'print $F[5] if defined $F[5]' file
You can also specify a delimiter pattern using the -F flag. Here is an example for *simple* comma separated data:
perl -F, -lane 'print $F[5]' file
There is also the Unix cut utility and the modules Text::CSV_XS and Text::xSV.
And finally, since perl -lane is just a synonym fo awk: ;-)
awk '{print $6}' file
--
John.
| [reply] [d/l] [select] |
Re: What is the fastest way to extract data from a delimited string?
by thezip (Vicar) on Jan 10, 2003 at 05:12 UTC
|
Thank you all for your enlightening comments! I will absorb them into my personal knowledgebase. I must admit, though, that I'm a little surprised with the facts as they have been presented.
Thank you BrowserUK and cchampion for reminding me that benchmarking always wins out over gut instincts.
Where do you want *them* to go today?
| [reply] |
Re: What is the fastest way to extract data from a delimited string?
by John M. Dlugosz (Monsignor) on Jan 10, 2003 at 16:06 UTC
|
If split isn't doing it for you, there seems to be no faster way in pure Perl.
You could write a filter in C. Using a memory-mapped file (or large buffer) it should be I/O bound and take no appreciable time to locate the delimiters. Copy them to standard output.
Then, call that as a pipe-in from the Perl open-file command.
—John | [reply] |
Re: What is the fastest way to extract data from a delimited string?
by JamesNC (Chaplain) on Jan 10, 2003 at 15:31 UTC
|
use strict;
my (@list) = map { /\w+\s+\w+\s+\w+\s+\w+\s+\w+\s+(\w+)\s+.*/} <DATA>;
print "@list";
__DATA__
1 2 3 4 5 6 7 8
james joe jack janet jill joann joeanne bill bob
alice joe jack janet jill alice joeanne bill bob
james joe jack janet jill susan joeanne bill bob
james joe jack janet jill sarah joeanne bill bob
james joe jack janet jill ethel joeanne bill bob
Only using map... very brief though :)
| [reply] [d/l] |
|
|
Not very fast or maintenable, though. How about something like:
sub read_stuff
{
my ($handle, $delim, $col) = @_;
my $prep = '^\s*';
$prep .= "\w+${delim}+" for 1 .. $col - 1;
$prep .= "(\w+)(?:${delim}|$)";
my $regex = qr/$prep/;
my @list = map { /$regex/o } <$handle>;
return @list;
}
my $handle = IO::File->new(filename => "some_file");
my @list = read_stuff($handle, ' ', 6);
Remember something - "fast" is a relative term. You might squeeze an additional millisecond from your CPU, but require an extra 8 hours every time you want to make a change to your code. To me, that's not worth it. My 8 hours is worth more than one millisecond of CPU time. To me, "fast" has to do with acceptable CPU progress and lightning-quick maintenance time. YMMV.
------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. | [reply] [d/l] |
Re: What is the fastest way to extract data from a delimited string?
by hardburn (Abbot) on Jan 10, 2003 at 17:05 UTC
|
Is there any reason why you can't move this stuff into a real database instead of a flat file? If you can't install MySQL or PostgreSQL or *insert favoriate RDMS*, you can get the DBD::SQLite, which embeds an RDMS right into the module itself.
| [reply] |
Re: What is the fastest way to extract data from a delimited string?
by Anonymous Monk on Jan 11, 2003 at 08:22 UTC
|
| [reply] |