Reading a huge input line in parts

kroach has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Reading a huge input line in parts (Handles multi-digit numbers!) by BrowserUk (Patriarch) on May 04, 2015 at 13:54 UTC
How about reading a block at a time and spliting that?: `sub genBufferedGetNum { my @buf = do{ local $/ = \4096; split ' ', scalar <>; } my $leftover = pop @buf; return sub { unless( @buf ) { unless( eof ) { @buf = do{ local $/ = \4096; split ' ', $leftover . <> + }; $leftover = pop @buf; } else { die 'premature eof' if $leftover != 0; return $leftover; # last number } } return shift @buf; }; } my $getNum = genBufferedGetNum(); while( my $num = getNum->() ) { ## do stuff }` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l]
Re^2: Reading a huge input line in parts (Handles multi-digit numbers!) by GotToBTru (Prior) on May 04, 2015 at 19:19 UTC
Could this not be simplified? `sub genBufferedGetNum { return sub { @buf = do{ local $/ = \10; split ' ', <> }; return @buf; }; } my $getNum = genBufferedGetNum(); while( my @part = $getNum->() ) { print @part, "\n"; }` [download] I shortened the buffer size for testing purposes `$: cat tb.dat 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 $:` [download] `$: cat tb.dat \| perl tb.pl 01234 56789 01234 56789 01234 56789 0` [download] Dum Spiro Spero	[reply] [d/l] [select]
Re^3: Reading a huge input line in parts (Handles multi-digit numbers!) by BrowserUk (Patriarch) on May 04, 2015 at 22:37 UTC
Could this not be simplified? And what happens when your buffer size splits a multi-digit number in two? Ie. Run your code against this input: `123 456 789 1` [download] And it produces: `123 456 78 9 1` And doesn't notice that the last number is supposed to be 0. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re: Reading a huge input line in parts by hdb (Monsignor) on May 04, 2015 at 14:08 UTC
What is wrong with the last number? Is it ignored? Then undo your setting of `$/` and read again. Does it have a newline? Then use a regex to get rid of it. In any case I would think you should `chomp` your input to get rid of the blanks. Update: try to add `s/\s//g` before your call to `do_something`.	[reply] [d/l] [select]
Re^2: Reading a huge input line in parts by kroach (Pilgrim) on May 04, 2015 at 15:59 UTC
The last number is not read unless eof is encountered. I can't undo the setting of $/ and read again because I have no way of detecting the last number. If I undo it midway I would get the rest of the line, which could be enormous.	[reply]
Re^3: Reading a huge input line in parts by hdb (Monsignor) on May 04, 2015 at 16:11 UTC
I cannot reproduce your problem but I was thinking of `use strict; use warnings; sub do_something { print '{', $_[0], "}\n" } { local $/ = ' '; while (<>) { do_something($_); } } do_something(<>);` [download]	[reply] [d/l]
Re^3: Reading a huge input line in parts by Laurent_R (Canon) on May 04, 2015 at 17:58 UTC
It should not be to costly in terms of resources and performance to check if you have a space at the beginning and at the end of each chunk of data before splitting it, and reconstruct the boundary numbers accordingly, especially if your read data chunks are relatively large. Update: Ooops, this was meant as an answer to the following post: Re^2: Reading a huge input line in parts, sorry for inconvenience. Je suis Charlie.	[reply]
Re: Reading a huge input line in parts by flexvault (Monsignor) on May 04, 2015 at 20:08 UTC
Hello kroach, I tried to compare 2 way of doing this, and clearly letting Perl do the buffering wins out, but with the size of your line, you may want to look at the 2nd subroutine 'getnum_new' for how to do partial reads from the file. I think both will work for your requirement ( memory allowing ). Reading a line at a time was about 4-6 times faster. use strict; use warnings; use Benchmark qw(:all); our ( $eof, $buffer ); # Build a file for testing! open ( my $data, ">", "./slurp.txt" ) \|\| die "$!"; for my $lines ( 0..10 ) { my $unit = ''; for my $nos ( 0..30) { $unit .= int( rand(3000) ) . " "; # simulate keys } $unit .= $lines; # make sure last doesn't have +space. print $data "$unit\n"; } close $data; my $sa = &getnum1; my $sb = &getnum2; # print "sa\|$sa\n\nsb\|$sb\n"; exit; if ( $sa ne $sb ) { print "Didn't Work!\n"; exit(1); } timethese ( -9 , { case1 => sub { &getnum1 }, case2 => sub { &getnum2 }, }, ); sub getnum1 { my $s1 = ''; open ( my $data, "<", "./slurp.txt" ) \|\| die "$!"; while ( my $line = <$data> ) { chomp( $line ); my @ar = split( /\ /, $line ); for ( 0..$#ar ) { $s1 .= "$ar[$_],"; } } close $data; return $s1; } sub getnum2 { my $s2 = ''; $eof = 0; open ( my $inp, "<", "./slurp.txt" ) \|\| die "$!"; while ( 1 ) { $s2 .= getnum_new( \$inp ) . ','; if ( $eof ) { chop $s2; last; } } close $inp; return $s2; } sub getnum_new { my $file = shift; my $ret = ''; our $eof; our $buffer; while( 1 ) { if ( ! $buffer ) { my $size = read ( $$file, $buffer, 1024 ); if ( $size == 0 ) { $eof = 1; return $ret; } } my $val = substr( $buffer,0,1,''); if ( ( $val eq ' ' )\|\|( $val eq "\n" ) ) { return $ret; } $ret .= $val; } } [download] That's one long line :-) Regards...Ed "Well done is better than well said." - Benjamin Franklin	[reply] [d/l]
Re: Reading a huge input line in parts by aaron_baugher (Curate) on May 04, 2015 at 13:39 UTC
You could probably gain quite a bit of speed by reading in chunks of the line instead of one character at a time. That way you can use the normal split function. Something like this, but with as large a buffer value as your system can handle well: `#!/usr/bin/env perl use 5.010; use strict; use warnings; my $l; # chunk of a line my $tiny_buffer = 8; # tiny buffer for testing while( read DATA, $l, $tiny_buffer ){ for (split ' ', $l){ if( $_ eq '0' ){ say 'Reached the end'; exit; } say "; $_ ;"; # do stuff with the digit } } __DATA__ 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 0` [download] Aaron B. Available for small or large Perl jobs and *nix system administration; see my home node.	[reply] [d/l]
Re^2: Reading a huge input line in parts by kroach (Pilgrim) on May 04, 2015 at 15:53 UTC
I thought about using read, however, since the numbers are not constant length, a single number could be split between two chunks. This would introduce additional complexity to detect and merge such split numbers. I should've included such examples in the sample input from the start, I've updated the question.	[reply]
Re^3: Reading a huge input line in parts by Laurent_R (Canon) on May 04, 2015 at 18:05 UTC
It should not be to costly in terms of resources and performance to check if you have a space at the beginning and at the end of each chunk of data before splitting it, and reconstruct the boundary numbers accordingly, especially if your read data chunks are relatively large. Je suis Charlie.	[reply]
Re^3: Reading a huge input line in parts by aaron_baugher (Curate) on May 04, 2015 at 22:36 UTC
In that case, I'd check the end of the buffer for digits, and if there are any, trim them off and save them to prepend to the next buffer that you read in. But you don't want to do that if it's the final 0 in the file, so I have some if statements in here. There's probably a more elegant way to do some of this, but I think this will handle it correctly: #!/usr/bin/env perl use 5.010; use strict; use warnings; my $l; # chunk of a line my $tiny_buffer = 8; # tiny buffer for testing my $leftover = ''; # leftover, possibly partial number at end of buf +fer while ( read DATA, $l, $tiny_buffer ) { $l = $leftover . $l; say " ;$l;"; $leftover = ''; if( $l =~ s/(\d+)$//g ){ if( $1 == 0 ){ $l .= '0'; $leftover = ''; } else { $leftover = $1; } } for (split ' ', $l) { if ( $_ == 0 ) { say 'Reached a zero'; } else { say "; $_ ;"; # process a number } } } __DATA__ 1 2 3 4 5 6 7 8 99 1 2 3 4 5 6 7 8 9 0 1 22 3 4 5 6 7 8 99 1 2 3 4 5 6 77 8 9 0 [download] Aaron B. Available for small or large Perl jobs and *nix system administration; see my home node.	[reply] [d/l]
Re: Reading a huge input line in parts by hdb (Monsignor) on May 04, 2015 at 16:54 UTC
You say you cannot afford to slurp and split, can you afford to slurp? Then use a regex to extract the digits one by one. `my $all = <>; do_something($1) while $all =~ /(\d+)/g;` [download]	[reply] [d/l]
Re^2: Reading a huge input line in parts by kroach (Pilgrim) on May 04, 2015 at 17:34 UTC
I can't afford to slurp.	[reply]
Re: Reading a huge input line in parts by CountZero (Bishop) on May 04, 2015 at 21:25 UTC
I get a different result when using a space as the delimiter. The zero at the end of the line gets recognized OK, but it is the first figure at the next line that gets skipped. So this small test program takes care of that problem: `use Modern::Perl qw/2014/; { local $/ = ' '; while (<DATA>) { chomp; if (/^0\n*$/) { say "0 - End of line"; next; } elsif (/^0\n(\d+)$/) { say "0 - End of line"; say ">$1<"; next; } else { say ">$_<"; } } } __DATA__ 1 34 282716 7 20 333333 91 0 23 68 82629172 112 8271718 102 1 0 7 211 2 123 0 99 666 0` [download] Output: `>1< >34< >282716< >7< >20< >333333< >91< 0 - End of line >23< >68< >82629172< >112< >8271718< >102< >1< 0 - End of line >7< >211< >2< >123< 0 - End of line >99< >666< 0 - End of line` [download] As you can see, a single zero is recognized as an end-of-line marker, even when not physically at the end of a line. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply] [d/l] [select]
Re: Reading a huge input line in parts by CountZero (Bishop) on May 04, 2015 at 18:33 UTC
Just out of sheer curiosity: how long is very long? CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re^2: Reading a huge input line in parts by kroach (Pilgrim) on May 04, 2015 at 19:53 UTC
The lines in question can be up to 2 700 000 000 000 000 characters.	[reply]
Re^3: Reading a huge input line in parts by graff (Chancellor) on May 05, 2015 at 03:29 UTC
Given quantities of that magnitude, and the relative simplicity of the task (breaking the stream into a sequence of numerics), I'd say it's worthwhile to write an application in C and compile it. It would be a short and easy program to write, esp. as a stdin-stdout filter: it's just a while loop that reads a nice size char buffer (say, a few MB at a time), and steps through the buffer one character at a time, accumulating consecutive digit characters, and outputting the string of digits every time you encounter a non-digit character. It wouldn't be more than 20 lines of C code, if that, and you'll save a lot of run-time. I suppose there must be more to your overall process than just splitting into digit strings; you could still do that extra part of your process in perl, but have the perl script read from the output of the C program. (But again, given the quantity of data, if the other stuff can be done in C without too much trouble, I'd do that.) UPDATE: Okay, I admit I was wrong about how many lines of C it would take. This C program is 26 30 lines (not counting the 4 blank lines added for legibility): Read more... (1171 Bytes) (2nd update: added four more lines at the end to handle the case where the last char in the stream happens to be a digit.)	[reply] [d/l]
Re^4: Reading a huge input line in parts by kroach (Pilgrim) on May 05, 2015 at 20:32 UTC
Re^5: Reading a huge input line in parts by Anonymous Monk on May 05, 2015 at 21:48 UTC
Re^3: Reading a huge input line in parts by CountZero (Bishop) on May 04, 2015 at 21:00 UTC
Now that is long indeed! Assuming you can read and process a gigabyte of data per second, handling a line that long will take you more than a month. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re: Reading a huge input line in parts by pme (Monsignor) on May 04, 2015 at 13:53 UTC
This one may solve the problem but the performance can be poor. `use strict; use warnings; sub do_something { print '{', $_[0], "}\n" } local $/ = ' '; while (<DATA>) { s/\n/ /; # replace end-of-line with space my @a = split(' '); # split line at spaces do_something($_) for @a; } __DATA__ 1 2 3 4 5 0 6 7 8 9 10 0` [download]	[reply] [d/l]
Re^2: Reading a huge input line in parts by aaron_baugher (Curate) on May 04, 2015 at 23:54 UTC
The performance on that may not be as bad as you think. I tried benchmarking my read-by-chunks solution against a change-the-input-record-separator-to-space solution. The latter makes the code much simpler, since the only special thing you have to watch for is the newlines. But it was also a bit quicker: $ perl 1125570a.pl Rate read_buffer change_irs read_buffer 1.15/s -- -33% change_irs 1.72/s 50% -- $ cat 1125570a.pl #!/usr/bin/env perl use Modern::Perl; use Benchmark qw(:all); # setup long multiline strings with lines ending in 0 my $line1 = join ' ', (map { int(rand()100) } 1..1000000), 0; $line1 =~ s/ 0 / 0\n/g; my $line2 = $line1; cmpthese( 10, { 'read_buffer' => \&read_buffer, 'change_irs' => \&change_irs, }); sub read_buffer { my $l; # chunk of a line my $tiny_buffer = 1000000; # buffer size of chunks my $leftover = ''; # leftover, possibly partial number at end o +f buffer open my $in, '<', \$line1; while ( read $in, $l, $tiny_buffer ) { $l = $leftover . $l; # say " ;$l;"; $leftover = ''; if ( $l =~ s/(\d+)$//g ) { if ( $1 == 0 ) { $l .= '0'; $leftover = ''; } else { $leftover = $1; } } for (split ' ', $l) { if ( $_ == 0 ) { # say 'Reached a zero'; } else { # say "; $_ ;"; # process a number } } } } sub change_irs { open my $in, '<', \$line2; local $/ = ' '; while ( <$in> ) { # say " $_"; if ( $_ =~ /0\n(\d+)/ ) { # say 'Reached a zero'; # say "; $1 ;"; # process a number } elsif ( $_ == 0){ # say 'Reached a zero'; } else { # say "; $_ ;"; # process a number } } } [download] The larger the buffer you can use on the read_buffer solution, the faster it should be, I think, but I don't know if it would ever catch up to the `$/=' '` solution. Considering how much clearer that one's code is, I think it wins. EDIT: It also occurs to me that reading the file from disc might make a difference, if the RS=space solution causes more disc reads. I'd think OS buffering would prevent that, but I don't know for sure. You'd want to benchmark that with your actual situation. Aaron B. Available for small or large Perl jobs and nix system administration; see my home node.	[reply] [d/l] [select]
Re^2: Reading a huge input line in parts by kroach (Pilgrim) on May 04, 2015 at 21:16 UTC
This is not different than my first approach. Replacing newline here occurs only after the data is read, so it doesn't change anything. Since $/ was changed, the newline is just like any other character. If there was a way to treat the newline in input as a space or set $/ to "\s", that would help.	[reply]
Re: Reading a huge input line in parts by Anonymous Monk on May 05, 2015 at 03:54 UTC
A couple simple if hackish ways to handle this: `use 5.014; $/ = \8192; while (<>) { # like so state $buf .= $_ . ' 'x eof; $buf =~ s{ \s* (\S+) \s }{ process($1), "" }xge; } while (<>) { # ..or so state $buf .= $_ . ' 'x eof; $buf = pop( my $tok = [split ' ', $buf, -1] ); process(@$tok); } sub process { say for @_ }` [download] The `' 'x eof` may be omitted if \n endings are guaranteed.	[reply] [d/l] [select]