Re: Reading a huge input line in parts (Handles multi-digit numbers!)
by BrowserUk (Patriarch) on May 04, 2015 at 13:54 UTC
|
sub genBufferedGetNum {
my @buf = do{ local $/ = \4096; split ' ', scalar <>; }
my $leftover = pop @buf;
return sub {
unless( @buf ) {
unless( eof ) {
@buf = do{ local $/ = \4096; split ' ', $leftover . <>
+ };
$leftover = pop @buf;
}
else {
die 'premature eof' if $leftover != 0;
return $leftover; # last number
}
}
return shift @buf;
};
}
my $getNum = genBufferedGetNum();
while( my $num = getNum->() ) {
## do stuff
}
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] [d/l] |
|
|
sub genBufferedGetNum {
return sub {
@buf = do{ local $/ = \10; split ' ', <> };
return @buf;
};
}
my $getNum = genBufferedGetNum();
while( my @part = $getNum->() ) {
print @part, "\n";
}
I shortened the buffer size for testing purposes
$: cat tb.dat
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
$:
$: cat tb.dat | perl tb.pl
01234
56789
01234
56789
01234
56789
0
| [reply] [d/l] [select] |
|
|
123 456 789 1
And it produces: 123 456 78 9 1
And doesn't notice that the last number is supposed to be 0.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] [d/l] [select] |
Re: Reading a huge input line in parts
by hdb (Monsignor) on May 04, 2015 at 14:08 UTC
|
What is wrong with the last number? Is it ignored? Then undo your setting of $/ and read again. Does it have a newline? Then use a regex to get rid of it. In any case I would think you should chomp your input to get rid of the blanks.
Update: try to add s/\s//g before your call to do_something.
| [reply] [d/l] [select] |
|
|
The last number is not read unless eof is encountered. I can't undo the setting of $/ and read again because I have no way of detecting the last number. If I undo it midway I would get the rest of the line, which could be enormous.
| [reply] |
|
|
use strict;
use warnings;
sub do_something { print '{', $_[0], "}\n" }
{
local $/ = ' ';
while (<>) {
do_something($_);
}
}
do_something(<>);
| [reply] [d/l] |
|
|
| [reply] |
Re: Reading a huge input line in parts
by flexvault (Monsignor) on May 04, 2015 at 20:08 UTC
|
Hello kroach,
I tried to compare 2 way of doing this, and clearly letting Perl do the buffering wins out, but with the size of your line, you may want to look at the 2nd subroutine 'getnum_new' for how to do partial reads from the file. I think both will work for your requirement ( memory allowing ). Reading a line at a time was about 4-6 times faster.
use strict;
use warnings;
use Benchmark qw(:all);
our ( $eof, $buffer );
# Build a file for testing!
open ( my $data, ">", "./slurp.txt" ) || die "$!";
for my $lines ( 0..10 )
{ my $unit = '';
for my $nos ( 0..30)
{ $unit .= int( rand(3000) ) . " "; # simulate keys
}
$unit .= $lines; # make sure last doesn't have
+space.
print $data "$unit\n";
}
close $data;
my $sa = &getnum1;
my $sb = &getnum2;
# print "sa|$sa\n\nsb|$sb\n"; exit;
if ( $sa ne $sb ) { print "Didn't Work!\n"; exit(1); }
timethese ( -9 ,
{
case1 => sub { &getnum1 },
case2 => sub { &getnum2 },
},
);
sub getnum1
{ my $s1 = '';
open ( my $data, "<", "./slurp.txt" ) || die "$!";
while ( my $line = <$data> )
{ chomp( $line );
my @ar = split( /\ /, $line );
for ( 0..$#ar ) { $s1 .= "$ar[$_],"; }
}
close $data;
return $s1;
}
sub getnum2
{ my $s2 = ''; $eof = 0;
open ( my $inp, "<", "./slurp.txt" ) || die "$!";
while ( 1 )
{ $s2 .= getnum_new( \$inp ) . ',';
if ( $eof ) { chop $s2; last; }
}
close $inp;
return $s2;
}
sub getnum_new
{ my $file = shift; my $ret = ''; our $eof; our $buffer;
while( 1 )
{ if ( ! $buffer )
{ my $size = read ( $$file, $buffer, 1024 );
if ( $size == 0 ) { $eof = 1; return $ret; }
}
my $val = substr( $buffer,0,1,'');
if ( ( $val eq ' ' )||( $val eq "\n" ) ) { return $ret; }
$ret .= $val;
}
}
That's one long line :-)
Regards...Ed
"Well done is better than well said." - Benjamin Franklin
| [reply] [d/l] |
Re: Reading a huge input line in parts
by aaron_baugher (Curate) on May 04, 2015 at 13:39 UTC
|
You could probably gain quite a bit of speed by reading in chunks of the line instead of one character at a time. That way you can use the normal split function. Something like this, but with as large a buffer value as your system can handle well:
#!/usr/bin/env perl
use 5.010; use strict; use warnings;
my $l; # chunk of a line
my $tiny_buffer = 8; # tiny buffer for testing
while( read DATA, $l, $tiny_buffer ){
for (split ' ', $l){
if( $_ eq '0' ){
say 'Reached the end';
exit;
}
say "; $_ ;"; # do stuff with the digit
}
}
__DATA__
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 0
Aaron B.
Available for small or large Perl jobs and *nix system administration; see my home node.
| [reply] [d/l] |
|
|
I thought about using read, however, since the numbers are not constant length, a single number could be split between two chunks. This would introduce additional complexity to detect and merge such split numbers.
I should've included such examples in the sample input from the start, I've updated the question.
| [reply] |
|
|
It should not be to costly in terms of resources and performance to check if you have a space at the beginning and at the end of each chunk of data before splitting it, and reconstruct the boundary numbers accordingly, especially if your read data chunks are relatively large.
| [reply] |
|
|
In that case, I'd check the end of the buffer for digits, and if there are any, trim them off and save them to prepend to the next buffer that you read in. But you don't want to do that if it's the final 0 in the file, so I have some if statements in here. There's probably a more elegant way to do some of this, but I think this will handle it correctly:
#!/usr/bin/env perl
use 5.010; use strict; use warnings;
my $l; # chunk of a line
my $tiny_buffer = 8; # tiny buffer for testing
my $leftover = ''; # leftover, possibly partial number at end of buf
+fer
while ( read DATA, $l, $tiny_buffer ) {
$l = $leftover . $l;
say " ;$l;";
$leftover = '';
if( $l =~ s/(\d+)$//g ){
if( $1 == 0 ){
$l .= '0';
$leftover = '';
} else {
$leftover = $1;
}
}
for (split ' ', $l) {
if ( $_ == 0 ) {
say 'Reached a zero';
} else {
say "; $_ ;"; # process a number
}
}
}
__DATA__
1 2 3 4 5 6 7 8 99 1 2 3 4 5 6 7 8 9 0
1 22 3 4 5 6 7 8 99 1 2 3 4 5 6 77 8 9 0
Aaron B.
Available for small or large Perl jobs and *nix system administration; see my home node.
| [reply] [d/l] |
Re: Reading a huge input line in parts
by hdb (Monsignor) on May 04, 2015 at 16:54 UTC
|
You say you cannot afford to slurp and split, can you afford to slurp? Then use a regex to extract the digits one by one.
my $all = <>;
do_something($1) while $all =~ /(\d+)/g;
| [reply] [d/l] |
|
|
| [reply] |
Re: Reading a huge input line in parts
by CountZero (Bishop) on May 04, 2015 at 21:25 UTC
|
I get a different result when using a space as the delimiter.The zero at the end of the line gets recognized OK, but it is the first figure at the next line that gets skipped. So this small test program takes care of that problem: use Modern::Perl qw/2014/;
{
local $/ = ' ';
while (<DATA>) {
chomp;
if (/^0\n*$/) {
say "0 - End of line";
next;
}
elsif (/^0\n(\d+)$/) {
say "0 - End of line";
say ">$1<";
next;
}
else {
say ">$_<";
}
}
}
__DATA__
1 34 282716 7 20 333333 91 0
23 68 82629172 112 8271718 102 1 0
7 211 2 123 0 99 666 0
Output:>1<
>34<
>282716<
>7<
>20<
>333333<
>91<
0 - End of line
>23<
>68<
>82629172<
>112<
>8271718<
>102<
>1<
0 - End of line
>7<
>211<
>2<
>123<
0 - End of line
>99<
>666<
0 - End of line
As you can see, a single zero is recognized as an end-of-line marker, even when not physically at the end of a line.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] [d/l] [select] |
Re: Reading a huge input line in parts
by CountZero (Bishop) on May 04, 2015 at 18:33 UTC
|
Just out of sheer curiosity: how long is very long?
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] |
|
|
The lines in question can be up to 2 700 000 000 000 000 characters.
| [reply] |
|
|
Given quantities of that magnitude, and the relative simplicity of the task (breaking the stream into a sequence of numerics), I'd say it's worthwhile to write an application in C and compile it.
It would be a short and easy program to write, esp. as a stdin-stdout filter: it's just a while loop that reads a nice size char buffer (say, a few MB at a time), and steps through the buffer one character at a time, accumulating consecutive digit characters, and outputting the string of digits every time you encounter a non-digit character. It wouldn't be more than 20 lines of C code, if that, and you'll save a lot of run-time.
I suppose there must be more to your overall process than just splitting into digit strings; you could still do that extra part of your process in perl, but have the perl script read from the output of the C program. (But again, given the quantity of data, if the other stuff can be done in C without too much trouble, I'd do that.)
UPDATE: Okay, I admit I was wrong about how many lines of C it would take. This C program is 26 30 lines (not counting the 4 blank lines added for legibility):
(2nd update: added four more lines at the end to handle the case where the last char in the stream happens to be a digit.)
| [reply] [d/l] |
|
|
|
|
|
|
| [reply] |
Re: Reading a huge input line in parts
by pme (Monsignor) on May 04, 2015 at 13:53 UTC
|
This one may solve the problem but the performance can be poor.
use strict;
use warnings;
sub do_something { print '{', $_[0], "}\n" }
local $/ = ' ';
while (<DATA>) {
s/\n/ /; # replace end-of-line with space
my @a = split(' '); # split line at spaces
do_something($_) for @a;
}
__DATA__
1 2 3 4 5 0
6 7 8 9 10 0
| [reply] [d/l] |
|
|
The performance on that may not be as bad as you think. I tried benchmarking my read-by-chunks solution against a change-the-input-record-separator-to-space solution. The latter makes the code much simpler, since the only special thing you have to watch for is the newlines. But it was also a bit quicker:
$ perl 1125570a.pl
Rate read_buffer change_irs
read_buffer 1.15/s -- -33%
change_irs 1.72/s 50% --
$ cat 1125570a.pl
#!/usr/bin/env perl
use Modern::Perl;
use Benchmark qw(:all);
# setup long multiline strings with lines ending in 0
my $line1 = join ' ', (map { int(rand()*100) } 1..1000000), 0;
$line1 =~ s/ 0 / 0\n/g;
my $line2 = $line1;
cmpthese( 10, {
'read_buffer' => \&read_buffer,
'change_irs' => \&change_irs,
});
sub read_buffer {
my $l; # chunk of a line
my $tiny_buffer = 1000000; # buffer size of chunks
my $leftover = ''; # leftover, possibly partial number at end o
+f buffer
open my $in, '<', \$line1;
while ( read $in, $l, $tiny_buffer ) {
$l = $leftover . $l;
# say " ;$l;";
$leftover = '';
if ( $l =~ s/(\d+)$//g ) {
if ( $1 == 0 ) {
$l .= '0';
$leftover = '';
} else {
$leftover = $1;
}
}
for (split ' ', $l) {
if ( $_ == 0 ) {
# say 'Reached a zero';
} else {
# say "; $_ ;"; # process a number
}
}
}
}
sub change_irs {
open my $in, '<', \$line2;
local $/ = ' ';
while ( <$in> ) {
# say " $_";
if ( $_ =~ /0\n(\d+)/ ) {
# say 'Reached a zero';
# say "; $1 ;"; # process a number
} elsif ( $_ == 0){
# say 'Reached a zero';
} else {
# say "; $_ ;"; # process a number
}
}
}
The larger the buffer you can use on the read_buffer solution, the faster it should be, I think, but I don't know if it would ever catch up to the $/=' ' solution. Considering how much clearer that one's code is, I think it wins.
EDIT: It also occurs to me that reading the file from disc might make a difference, if the RS=space solution causes more disc reads. I'd think OS buffering would prevent that, but I don't know for sure. You'd want to benchmark that with your actual situation.
Aaron B.
Available for small or large Perl jobs and *nix system administration; see my home node.
| [reply] [d/l] [select] |
|
|
This is not different than my first approach. Replacing newline here occurs only after the data is read, so it doesn't change anything. Since $/ was changed, the newline is just like any other character. If there was a way to treat the newline in input as a space or set $/ to "\s", that would help.
| [reply] |
Re: Reading a huge input line in parts
by Anonymous Monk on May 05, 2015 at 03:54 UTC
|
use 5.014;
$/ = \8192;
while (<>) { # like so
state $buf .= $_ . ' 'x eof;
$buf =~ s{ \s* (\S+) \s }{ process($1), "" }xge;
}
while (<>) { # ..or so
state $buf .= $_ . ' 'x eof;
$buf = pop( my $tok = [split ' ', $buf, -1] );
process(@$tok);
}
sub process { say for @_ }
The ' 'x eof may be omitted if \n endings are guaranteed.
| [reply] [d/l] [select] |