Reading files, skipping very long lines...

Excalibor has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Reading files, skipping very long lines... by Limbic~Region (Chancellor) on Sep 29, 2005 at 17:09 UTC
Excalibor, The normal solution to this would be `while ( <FILE> ) { next if length() > 1024 * 1024; # ... }` [download] This will not work for you because have some lines that can't be read into memory before determining how long they are. The following steps should outline the process for you. Set $/ to read a fixed number of bytes (your max desired) Read a line from the file my $buffer = <FH> If the resulting buffer contains a newline: Extract up to the newline leaving the remainder in the buffer Repeat this process until there are no newlines remaining in the buffer Go to step 2 and prepend the old buffer to the new buffer If there is no newline it means the line is too long Empty the current buffer and read in a new line Repeat this process until a newline is detected Once detected, drop everything in the buffer up to and including the first newline Go to step 2 and prepend the old buffer to the new buffer Wash, Rinse, Repeat until the file has been exhausted I would mock up an implementation for you but I don't have time ATM. Cheers - L~R	[reply] [d/l]
Re^2: Reading files, skipping very long lines... by Excalibor (Pilgrim) on Sep 29, 2005 at 17:42 UTC
Thanks for the advice... I actually tried to do this (using read() instead of $/, but I don't think it will do basically the same). It works, but the problem, then, is time. I am processing the file in real time, and it was taking ages (literally!) to read that 380+Mb long line... Better explained: a process inserts lines into a file, and I am processing it. Somehow, it inserts a 380+Mb long line, and I want to skip it, and wait for the next... Maybe going really low level and playing with IPC would do the trick... I gotta go now, but will think on it tomorrow... Conclusion: the method works, but it's too slow... I need a way to skip the line completely, and wait for a new line to be inserted into the file. (I wanna croak my $brain) Thanks for your help, fellow monks! -- `our $Perl6 is Fantastic;`	[reply] [d/l]
Re^3: Reading files, skipping very long lines... by pjf (Curate) on Sep 30, 2005 at 01:03 UTC
G'day Excalibor, All the suggestions so far have been fantastic, and it sounds like all you really need now is a very-fast 'discard line' subroutine. Be aware that regardless of how efficient your code may be, you'll be limited by the speed of the I/O operations provided by your operating system. If you've got to read 380Mb from disk, that's going to take some time regardless of how you process it. If possible, set your program running and take a look at what your system is doing. If you're on a unix-flavoured system, then `top` and `time` can help a lot. If you're hitting 100% CPU usage, and a lot of that is in userland time, then a tigher reading-loop may help. If you're not seeing 100% CPU usage, or you're seeing a very high amount of system time, then you're probably I/O bound. You'll need faster disks, hardware, and/or filesystems for your program's performance to improve. Assuming that you are CPU bound, you can potentially write your 'discard line' subroutine in C, which allows it to be very fast and compact. Here's an example using Inline::C `use Inline 'C'; # Example, skip a line of input from STDIN: skip_line(); # Look! The next line is read fine by Perl. print scalar <STDIN>; __END__ __C__ /* Read (and discard) until we find a newline / / NOTE: This will loop endlessly if it hits EOF * before finding a newline. Caveat lector. / void skip_line() { while( getchar() != '\n' ) { } }` [download] I haven't benchmarked that, but it should be both very memory efficient and fast. Be aware the of the problem that you will encounter if skip_line() hits EOF before a newline; unless you're very* sure of your input file you'll want to improve upon the sample code provided here. If you do benchmark, keep in mind that any caching by the CPU may make a significant difference to your end results. All the very best, Paul Fenwick Perl Training Australia	[reply] [d/l]
Re^3: Reading files, skipping very long lines... by Roy Johnson (Monsignor) on Sep 29, 2005 at 18:34 UTC
If you're reading each line as it's appended to the file, you can seek to the end of the file as soon as you see that the line is too long. Caution: Contents may have been coded under pressure.	[reply]
Re^4: Reading files, skipping very long lines... by rir (Vicar) on Sep 30, 2005 at 15:08 UTC
Re^5: Reading files, skipping very long lines... by Roy Johnson (Monsignor) on Sep 30, 2005 at 16:42 UTC
Some notes below your chosen depth have not been shown here
Re^3: Reading files, skipping very long lines... by ForgotPasswordAgain (Vicar) on Sep 30, 2005 at 14:41 UTC
Maybe you could try replacing the output file with a named pipe (man mkfifo). The program outputs to the pipe, and you have a filter program read from the pipe. Or depending on the predictability/frequency of the output, you could compare the file size once, then later if the file size isn't greater than $MAX_DIFFERENCE, you know you don't have to worry about it. Or patch the program to not insert long lines.	[reply]
Re: Reading files, skipping very long lines... by ikegami (Patriarch) on Sep 29, 2005 at 17:28 UTC
The following reads $block_size at a time (efficient), and keeps at most $max_line_size in memory at a time. my $block_size = 1000; my $max_line_size = 10000; my $buf = ''; my $offset = 0; my $read; my $line; my $pos; READ_LINE: for (;;) { EXTRACT_LINE: for (;;) { $pos = index($buf, $/); if ($pos >= 0) { $line = substr($buf, 0, $pos+1, ''); $offset = 0; last EXTRACT_LINE; } FILL_BUF: for (;;) { my $to_read; if ($offset + $block_size > $max_line_size) { $to_read = $max_line_size - $offset; } else { $to_read = $block_size; } if (not $to_read) { SKIP_LONG_LINE: for (;;) { $read = read($fh, $buf='', $block_size, $offset=0); die("Unable to read: $!") if not defined $read; if (not $read) { $line = undef; $offset = 0; last READ_LINE; } $pos = index($buf, $/); if ($pos >= 0) { substr($buf, 0, $pos+1, ''); $offset = $read - ($pos+1); last SKIP_LONG_LINE; } } next EXTRACT_LINE; } $read = read($fh, $buf, $to_read, $offset); die("Unable to read: $!") if not defined $read; if (not $read) { if (not $offset) { $line = undef; $offset = 0; last READ_LINE; } else { $line = $buf; $buf = ''; $offset = 0; last EXTRACT_LINE; } } $pos = index($buf, $/, $offset); if ($pos >= 0) { $line = substr($buf, 0, $pos+1, ''); $offset = 0; last EXTRACT_LINE; } $offset += $read; } } ...do something with $line... } [download] Untested.	[reply] [d/l]
Re: Reading files, skipping very long lines... by Roy Johnson (Monsignor) on Sep 29, 2005 at 17:06 UTC
You'll need to do your own buffering and look for newlines yourself. Read $MAX chars; if there's no newline in it (`/\n/`), keep reading and throwing away $MAX chars at a time until you find a newline. If there's a newline (and you're not in the middle of throwing away a superline), print everything up to the newline, and read $MAX-(number of chars after the newline) so that you have a total of $MAX characters to look at. Repeat from the top. Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^2: Reading files, skipping very long lines... by davidrw (Prior) on Sep 29, 2005 at 17:29 UTC
If there's a newline (and you're not in the middle of throwing away a superline), print everything up to the newline, ... and repeat --- could have a string like "foo\nbar\nstuff\nblah blah blah blah blah" ... (see Limbic~Region's detailed post below)	[reply]
Re: Reading files, skipping very long lines... by davidrw (Prior) on Sep 29, 2005 at 17:07 UTC
What was your attempt with `read()` ? It takes a length to read, so you can read chunks at a time .. but i think you have to handle the line breaks yourself ... Update: I started an attempt with read(), but hit a snag .. i think i need to restart my attempt and read 1 char at a time.. Read more... (753 Bytes) Update2: Not overly impressive coding, but i think this works (i created a long ~7.8M line with for f in `seq 1 1000000` ; do echo -n "blahblahblah" >> /tmp/longline ; done in bash and stuck it in the DATA section and it seemed to work): use strict; use warnings; use Data::Dumper; use constant MAX_LINE_LENGTH => 25; my $file = '/etc/hosts'; my @lines; my $line = ''; while( !eof DATA ){ my $c; read DATA, $c, 1; if( length($line) > MAX_LINE_LENGTH ){ $line = '' if $c eq "\n"; next; } if( $c eq "\n" ){ push @lines, $line; $line = ''; next; } $line .= $c; } print Dumper \@lines; __DATA__ this is a line aqwewqe short shrt short2 short3 this is a line asdas this is a another very ling line lkjkdsa to skip qweqweqwewqewqewqeqwe this is a line asdasd this is a line lkjqwe this is a very ling line lkjkdsa to skip qweqweqwewqewqewqeqwe this is a line ad as [download]	[reply] [d/l] [select]
Re: Reading files, skipping very long lines... by sauoq (Abbot) on Sep 29, 2005 at 17:18 UTC
Is there a way for me to skip it without having to seek for the end on line? No, you can't skip it unless you already know the position of the next newline (in which case you could `seek` past your long line.) You can read it in chunks and just toss the chunks away until you find one with a newline in it. Done right, that should take care of your memory issue. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: Reading files, skipping very long lines... by Happy-the-monk (Canon) on Sep 29, 2005 at 17:04 UTC
I am uncertain whether I see what you mean. Just as an idea, would the following do what you want? `cat file \| perl -le '$max=79; while(<>){print unless lentgh $_ > $max}'` If this would work for you, except from the memory issue, write the while loop properly with open first... if that still fails, consider using Tie::File. Cheers, Sören	[reply]
Re^2: Reading files, skipping very long lines... by Limbic~Region (Chancellor) on Sep 29, 2005 at 17:14 UTC
Happy-the-monk, I do not believe either one of these approaches will work (if I understand the problem correctly). Some lines are too long to read into a single variable so it is not possible to use length to determine if a line is too long. Using Tie::File would help since it only indexes where the newlines in the file begin, but you still need to read the whole line to determine if it is too long (`length $file[42] > 1024 * 1024`). I can see one way it may work though. If there is a way to get at the indices of the newlines, you would only have to subtract the 2 to determine if the line was too long. Cheers - L~R Update:The following is an untested proof-of-concept. `#!/usr/bin/perl use strict; use warnings; use Tie::File; my $obj = tie my @file, 'Tie::File', 'file.big' \|\| die "Unable to tie +'file.big': $!"; my $big = 1024 * 1024; for ( 0 .. $#file - 1 ) { my $beg = $obj->offset($_); my $end = $obj->offset($_ + 1); next if $end - $beg > $big; # process $file[$_]; } # Handle last line as special case my $beg = $obj->offset($#file); my $end = -s 'file.big'; if ( $end - $beg <= $big ) { # process $file[-1]; } #Cleanup undef $obj; untie @file;` [download]	[reply] [d/l] [select]