Re: split and sysread()
by BrowserUk (Patriarch) on Apr 19, 2003 at 09:15 UTC
|
When you originally asked this question at speed up one-line "sort|uniq -c" perl code you said that you only wanted the 10th field from an unspecified maximum number. In that case, using a regex to isolate that field alone rather than spliting them all out and then discarding all but one was an obvious way to save some cycles. Using the sliding buffer saved some more for an overall speed-up of about x4 in my tests.
You now appear to be wanting fields (0,3,4,9,17,18,31) which means that the benefits of using a regex over split are considerably lessened--though there is still some saving. Using this in conjunction with the sliding buffer--two variations on the theme, with sysread_1 giving consistantly the best results--and a buffer size of 64k seems to achieve the best results on my machine, with the main benefit seemingly coming from bypassing stdio.
The overall saving on my machine comes out at around 50%, whether this will get you close to your target of 2 minutes you have to see once you actually do something meaningful with the fields inside the loop. If not, I think you may need quicker hardware.
The file used in the tests below is (75MB) 500_000 records x 31 pipe-delimited fields of randomly generated data.
C:\test>215578 -BUFN=16 pipes.dat
1 trial of sysread (160.090s total)
1 trial of sysread2 (182.623s total)
1 trial of stdio (324.950s total)
sysread:20000 sysread2:20000 stdio:20000
Whilst I've tried various buffer sizes, the test are hardly definitive and you may well get better results with a different size (bigger or smaller)on your machine. Good luck.
Code
Examine what is said, not who speaks.
1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.
| [reply] [d/l] [select] |
|
|
with the main benefit seemingly coming from bypassing stdio
I just want to clarify what that means because to some people this may sound as if Perl IO is slow and it's always better to use sysread(). In the examples that BrowserUK++ provided, the main benefit comes from the fact that in case of sysread(), the code is looking at the data that is being read only once (plus a little overhead for looking for that last "\n"). In case of normal Perl IO, i.e, <FH>, every character is looked at twice -- first by Perl to figure out where each line ends, then by the code itself to split everything into separate fields. That's why you're seeing a ~50% increase in performance. You can also confirm this by checking the user and system times for normal Perl IO and sysread(). You'll see that system time is pretty much the same in both cases but user time will vary.
--perlplexer
| [reply] [d/l] |
|
|
| [reply] |
|
|
Re: split and sysread()
by pfaut (Priest) on Apr 18, 2003 at 23:48 UTC
|
Are your records terminated with newlines or are they fixed length? Your call to sysread looks very wrong. If the records are terminated by newlines, use <INFILE> to read a record. If they are fixed length, the third argument to sysread should be the record length and you shouldn't use the fourth argument.
This might make the first part work. Extending it to the second part is up to you.
while ($buffer = <INFILE>) {
chomp $buffer;
my ($server,$ip,$api,$calls) = (split('|',$buffer))[0,9,3,4];
$totalsentrycalls++;
$count_by_sentry_server{$server}++;
$count_by_ip{$ip}++;
$count_by_api{$api}++;
$count_by_api_exec{$api}{$totalsentrycalls}=$calls;
....
}
| 90% of every Perl application is already written. ⇒ | | dragonchild |
| [reply] [d/l] [select] |
|
|
They are terminated by new lines. The issue with the above is it still reads in the file one line at a time, right?. That's what I am trying to get away from.
| [reply] |
|
|
while (<INPUT>) {
my @fields = split /\|/ , $_;
#Do stuff with particular field
}
Cheers - L~R
Update: See this node by chromatic as he was setting me straight on the very same matter | [reply] [d/l] |
|
|
|
|
|
|
The issue with the above is it still reads in the file one line at a time, right?. That's what I am trying to get away from.
But in order to pick out the right fields, you have to know where a line starts and stops. There's no getting around that.
You can either use sysread() to pull stuff in by blocks, then juggle blocks to handle lines that span blocks, or you can uses Perl's line-by-line IO. Your call.
| [reply] |
|
|
Well, what you had would read the next 16K of data from the file and append it to $buffer each time through the loop. Was your intent to process the file 16K at a time? In that case, you would still have to remove the fourth argument to sysread as it was causing new data to be written to the end of $buffer.
Perl will allow you to read line-at-a-time or to slurp all of the file into an array with each element containing one line from the file. If you use sysread, breaking the data into lines is up to you. You would first have to break the data into lines by splitting on '\n' and then split the record into fields by splitting on '|'. It would be highly likely that your 16K read won't end at a record boundary so you would have to add code to merge the remaining data from one read with the beginning of the next. I'm not sure whatever you came up with would be more efficient than perl's line mode buffering.
| 90% of every Perl application is already written. ⇒ | | dragonchild |
| [reply] |
Re: split and sysread()
by dws (Chancellor) on Apr 18, 2003 at 23:38 UTC
|
| [reply] |
|
|
Take another look. The last line of the loop, $buffer = substr($buffer, rindex($buffer, "\n")); in conjunction with the length $buffer as the fourth parameter to sysread has the effect of grabbing any partial line from the end of the buffer and moving it to the beginning where the next buffer load is then appended.
In this way, every line is processed as a complete line whilst benefiting from reading large chunks from the file without resorting to slurping the whle file into memory.
See my post below for two variations of the algorithm and a benchmark.
Examine what is said, not who speaks.
1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.
| [reply] [d/l] [select] |
Re: split and sysread()
by Abigail-II (Bishop) on Apr 18, 2003 at 23:28 UTC
|
Well, $0 is the name of the program, and
since your while loop has a regex as condition, with the
regex having just 2 sets of parenthesis, at most $1
and $2 will be set.
The code doesn't make it clear to me at all that reading
in fixed length buffers is the right approach. Why not read
in one line at a time, which you then split using /[|]/
as regex?
Abigail | [reply] [d/l] [select] |
|
|
I'm sorry that I didn't make myself clear. The regex above just is $1 and $2 because I deleted the references up to $31. Reading in 1 line at a time with perl just takes too long (1.5 million lines - 14 to 20 minutes per file) and awk can do it in like 2 minutes. I'm trying to cut the 14 minutes down to ~ 2 minutes as much as possible.
| [reply] |
|
|
Well, to speed up your regex as much as possible, you must
make it so that there is as little possibility for backtracking
as possible. Try something like:
my $r = join "[|]" => ("([^|]*)") x 31;
while (/^$r\n/mg) { ... }
This (untested) code sets $1 through $31.
Abigail | [reply] [d/l] [select] |
Re: split and sysread()
by Aristotle (Chancellor) on Apr 19, 2003 at 15:03 UTC
|
| [reply] |