split and sysread()

relaxed137 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: split and sysread()
by BrowserUk (Patriarch) on Apr 19, 2003 at 09:15 UTC

When you originally asked this question at speed up one-line "sort|uniq -c" perl code you said that you only wanted the 10th field from an unspecified maximum number. In that case, using a regex to isolate that field alone rather than spliting them all out and then discarding all but one was an obvious way to save some cycles. Using the sliding buffer saved some more for an overall speed-up of about x4 in my tests.

You now appear to be wanting fields (0,3,4,9,17,18,31) which means that the benefits of using a regex over split are considerably lessened--though there is still some saving. Using this in conjunction with the sliding buffer--two variations on the theme, with sysread_1 giving consistantly the best results--and a buffer size of 64k seems to achieve the best results on my machine, with the main benefit seemingly coming from bypassing stdio.

The overall saving on my machine comes out at around 50%, whether this will get you close to your target of 2 minutes you have to see once you actually do something meaningful with the fields inside the loop. If not, I think you may need quicker hardware.

The file used in the tests below is (75MB) 500_000 records x 31 pipe-delimited fields of randomly generated data.

C:\test>215578 -BUFN=16 pipes.dat
1 trial of sysread (160.090s total)
1 trial of sysread2 (182.623s total)
1 trial of stdio (324.950s total)

sysread:20000 sysread2:20000 stdio:20000
[download]

Whilst I've tried various buffer sizes, the test are hardly definitive and you may well get better results with a different size (bigger or smaller)on your machine. Good luck.

Code

Read more... (3 kB)

Examine what is said, not who speaks.

1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.

[reply]
[d/l]
[select]

Re: Re: split and sysread()

by perlplexer (Hermit) on Apr 20, 2003 at 14:50 UTC

with the main benefit seemingly coming from bypassing stdio

BrowserUK

<FH>

times

[reply]
[d/l]

Re: Re: Re: split and sysread()

by dws (Chancellor) on Apr 20, 2003 at 16:11 UTC

In case of normal Perl IO, i.e, <FH>, every character is looked at twice -- first by Perl to figure out where each line ends, then by the code itself to split everything into separate fields. That's why you're seeing a ~50% increase in performance.

I disagree. The "looking at each character" is relatively cheap. Making a new string out of each line, however, is more expensive. The sysread() approach places less load on Perl's memory management.

[reply]

Re: Re: Re: Re: split and sysread()

by perlplexer (Hermit) on Apr 20, 2003 at 17:49 UTC

Re: split and sysread()
by pfaut (Priest) on Apr 18, 2003 at 23:48 UTC

Are your records terminated with newlines or are they fixed length? Your call to sysread looks very wrong. If the records are terminated by newlines, use <INFILE> to read a record. If they are fixed length, the third argument to sysread should be the record length and you shouldn't use the fourth argument.

This might make the first part work. Extending it to the second part is up to you.

while ($buffer = <INFILE>) {
    chomp $buffer;
    my ($server,$ip,$api,$calls) = (split('|',$buffer))[0,9,3,4];
    $totalsentrycalls++;
    $count_by_sentry_server{$server}++;
    $count_by_ip{$ip}++;
    $count_by_api{$api}++;
    $count_by_api_exec{$api}{$totalsentrycalls}=$calls;
    ....
}
[download]

90% of every Perl application is already written. ⇒

dragonchild

[reply]
[d/l]
[select]

Re: Re: split and sysread()

by relaxed137 (Acolyte) on Apr 18, 2003 at 23:58 UTC

They are terminated by new lines. The issue with the above is it still reads in the file one line at a time, right?. That's what I am trying to get away from.

[reply]

Re: Re: Re: split and sysread()

by Limbic~Region (Chancellor) on Apr 19, 2003 at 00:19 UTC

The issue with the above is it still reads in the file one line at a time, right?.

Wrong! Perl reads from the file in a buffer and only goes back out to disk when the buffer is empty. What <FILEHANDLE> does is read from the buffer up to the newline. I think you are falling prey to premature optimization.

while (<INPUT>) {
    my @fields = split /\|/ , $_;
    #Do stuff with particular field
}
[download]

Cheers - L~R

Update: See this node by chromatic as he was setting me straight on the very same matter

[reply]
[d/l]

Re: Re: Re: Re: split and sysread()

by relaxed137 (Acolyte) on Apr 19, 2003 at 00:32 UTC

Re: Re: Re: Re: Re: split and sysread()

by runrig (Abbot) on Apr 19, 2003 at 18:25 UTC

Re: Re: Re: split and sysread()

by dws (Chancellor) on Apr 18, 2003 at 23:59 UTC

The issue with the above is it still reads in the file one line at a time, right?. That's what I am trying to get away from.

But in order to pick out the right fields, you have to know where a line starts and stops. There's no getting around that.

You can either use sysread() to pull stuff in by blocks, then juggle blocks to handle lines that span blocks, or you can uses Perl's line-by-line IO. Your call.

[reply]

Re: Re: Re: split and sysread()

by pfaut (Priest) on Apr 19, 2003 at 00:12 UTC

Well, what you had would read the next 16K of data from the file and append it to $buffer each time through the loop. Was your intent to process the file 16K at a time? In that case, you would still have to remove the fourth argument to sysread as it was causing new data to be written to the end of $buffer.

Perl will allow you to read line-at-a-time or to slurp all of the file into an array with each element containing one line from the file. If you use sysread, breaking the data into lines is up to you. You would first have to break the data into lines by splitting on '\n' and then split the record into fields by splitting on '|'. It would be highly likely that your 16K read won't end at a record boundary so you would have to add code to merge the remaining data from one read with the beginning of the next. I'm not sure whatever you came up with would be more efficient than perl's line mode buffering.

90% of every Perl application is already written. ⇒

dragonchild

[reply]

Re: split and sysread()
by dws (Chancellor) on Apr 18, 2003 at 23:38 UTC

Matching in huge files describes a workaround that you might be able to adopt it to your purposes.

[reply]

Re: Re: split and sysread()

by BrowserUk (Patriarch) on Apr 19, 2003 at 09:31 UTC

Take another look. The last line of the loop, $buffer = substr($buffer, rindex($buffer, "\n")); in conjunction with the length $buffer as the fourth parameter to sysread has the effect of grabbing any partial line from the end of the buffer and moving it to the beginning where the next buffer load is then appended.

In this way, every line is processed as a complete line whilst benefiting from reading large chunks from the file without resorting to slurping the whle file into memory.

See my post below for two variations of the algorithm and a benchmark.

Examine what is said, not who speaks.

[reply]
[d/l]
[select]

Re: split and sysread()
by Abigail-II (Bishop) on Apr 18, 2003 at 23:28 UTC

$0

$1

$2

The code doesn't make it clear to me at all that reading in fixed length buffers is the right approach. Why not read in one line at a time, which you then split using /[|]/ as regex?

Abigail

[reply]
[d/l]
[select]

Re: Re: split and sysread()

by relaxed137 (Acolyte) on Apr 18, 2003 at 23:54 UTC

I'm sorry that I didn't make myself clear. The regex above just is $1 and $2 because I deleted the references up to $31. Reading in 1 line at a time with perl just takes too long (1.5 million lines - 14 to 20 minutes per file) and awk can do it in like 2 minutes. I'm trying to cut the 14 minutes down to ~ 2 minutes as much as possible.

[reply]

Re: split and sysread()

by Abigail-II (Bishop) on Apr 19, 2003 at 12:03 UTC

    my $r = join "[|]" => ("([^|]*)") x 31;
    while (/^$r\n/mg) { ... }
[download]

This (untested) code sets $1 through $31.

Abigail

[reply]
[d/l]
[select]

Re: split and sysread()
by Aristotle (Chancellor) on Apr 19, 2003 at 15:03 UTC

Makeshifts last the longest.

[reply]