muyprofesional has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I need to parse and process a very big file. What i need is use buffered read for speed (sysread). My problem is retrieving the lines after the read: the buffer stops in the middle of a line -obvious-, but:
#!/usr/bin/perl -w open my( $fh ), '<', "/usr/local/ffpde/logs/pruebas3.log"; my $buffer; while (sysread $fh, $buffer, 100) { my @lines = split(/"\n"/, $buffer); print @lines; sleep 1; }

Jul 26 10:45:25 - Sergio, 33 | Informático
Jul 26 11:45:25 - Angel, 23 | Encofrador
Jul 26 12:45:25 - Sergio, 52 | Repartidor
Jul 26 12:55:25 - Sergio, 18 | Repartidor
Jul 26 13:25:25 - Angel, 42 | Panadero
Jul 26 13:35:25 - Dario, 34 | Informático
Jul 26 15:45:25 - Luis, 26 | Repartidor

Prints the array well, and seems to contain the correct split. But when i want to load one scalar with this lines:
#!/usr/bin/perl -w open my( $fh ), '<', "/usr/local/ffpde/logs/pruebas3.log"; my $buffer; while (sysread $fh, $buffer, 100) { my @lines = split(/"\n"/, $buffer); for $l (@lines) { print "$l\n"; sleep 1; } }

Jul 26 10:45:25 - Sergio, 33 | Informático
Jul 26 11:45:25 - Angel, 23 | Encofrador
Jul 26 12:45:25
- Sergio, 52 | Repartidor
Jul 26 12:55:25 - Sergio, 18 | Repartidor
Jul 26 13:25:25 - Angel, 42 | P
anadero
Jul 26 13:35:25 - Dario, 34 | Informático
Jul 26 15:45:25 - Luis, 26 | Repartidor
Jul 26 16
:25:25 - Mabel, 41 | Azafata
Jul 26 17:29:25 - Laura, 19 | Investigadora
Jul 26 10:45:25 - Sergio, 3
3 | Informático
Jul 26 11:45:25 - Angel, 23 | Encofrador
Jul 26 12:45:25 - Sergio, 52 | Repartidor
Jul 26 12:55:25 - Sergio, 18 | Repartidor
Jul 26 13:25:25 - Angel, 42 | Panadero
Jul 26 13:35:25 - D
ario, 34 | Informático

It splits the line where the buffer stopped. (oops)

The buffer size doesn't matter, just for the example to see 2-3 lines of read, its the same with 4096 bytes.

So: is there any method to avoid this cut-line fact when reading with buffers? Which is the best method to load a array with the line-by-line correct content of the file. I must use buffers for speed, simple open lacks of speed for me.

Thanks in advance monks!

Replies are listed 'Best First'.
Re: Line by line buffered read
by JavaFan (Canon) on Aug 20, 2010 at 16:09 UTC
    Why are you reading using sysread? Why can't you just read line by line? If it's the sleeps you need, just count the characters you've read, and if it exceeds 4096 (or some other number), reset your counter and sleep. I don't understand the "I must use buffers for speed".
Re: Line by line buffered read
by BrimBorium (Friar) on Aug 20, 2010 at 16:34 UTC

    I usually use something like:

    use strict; use warnings; open (IN,"<file"); my $line; while($line=<IN>){ do_something_with_line(); } close(IN);

    probably nobody advised you to read "How do I post a question effectively?", so I do, especially about "Use strict and warnings".

      That's kind of what I was thinking of, but I would prefer to modify the open statement to include a die statement, such as below:

      open (IN,"<",$file) || die "Unable to open file '$file': $!\n";

      Also, I would think that the sleep statements in the OP are actually "slowing" it down by making it run longer. Based on the code provided, it doesn't look like the sleep statements are needed. Of course, since I have never used sysread, I may be completely wrong about this.

        Thansk for de "or die", the sleep is just for show best the ouptput each time i print, in a big file, else, u can't show it instead.
      Thanks. I need buffered read for speed.

        You're not making any sense. Line by line reading (while (<>)) is buffered. sysread, on the other hand, provides no buffering.

        If you want to provide your own buffering instead of using Perl's, you could do

        my $buf = ''; for (;;) { my $rv = sysread($fh, $buf, BLOCK_SIZE, length($buf)); die("sysread: $!") if !defined($rv); last if !$rv; process_line($1) while s/^([^\n]*\n)//; } process_line($buf) if length($buf);

        Update: Fixed problem mentioned by ibm1620 in comment.

Re: Line by line buffered read
by roboticus (Chancellor) on Aug 20, 2010 at 23:32 UTC

    muyprofesional:

    You're optimizing too soon!

    First make your code *work*, then if it's not fast enough, profile it to find out what's too slow. Then, and *only* then, make it work fast. If you did this, you would've simply used normal line-by-line entry. Then you wouldn't have started down this trail, since the normal line-by-line file reading is already buffered and fast.

    Until you know what's "slow", making something faster is a waste of time. For example if you have a program that's too slow, and the file reading is taking 5% of your time, then improving the file reading will get you a 5% speed increase *at best*! You'd profit more by speeding up whatever is consuming the other 95% of your time...

    ...roboticus

Re: Line by line buffered read
by MajingaZ (Beadle) on Aug 20, 2010 at 18:28 UTC
    But I it appears you aren't actually doing anything with the file? Why not just copying it?
    Or if you just want to work with the text and don't need to parse the lines of data, there is no need to split the $buffer. You're splitting on \n then basically putting them back in with the print?
    Solutions depend on what you are actually doing with the $buffer
    You could for example do a join "\n",@lines though I think you'll have to check to see if $buffer ended in /\n\z/
    Or you could just do a $buffer.=<$fh> keeping in mind that if you combine this with the previous that you'll want to output a terminating \n
    Or was this just to try and focus on the command you think is the problem? I have killed a couple servers when using while(<>) because of all the calls to the server, got yelled at by some IT peeps cause I was making some many calls to the server for lines of data.
Re: Line by line buffered read
by rowdog (Curate) on Aug 22, 2010 at 20:56 UTC

    As Ikegami pointed out, sysread is unbuffered io and I see no reason to use it here. It's much easier to use the buffered io functions and here's a simple revision of your example code that does just that.

    #!/usr/bin/perl use strict; use warnings; my $fname = '/var/log/Xorg.0.log'; open my $fh, '<', $fname or die $!; while ( my $line = <$fh> ) { print $line; sleep 1; }