in reply to Re^3: Out of Memory - Line of TXT too large
in thread Out of Memory - Line of TXT too large

First, thanks for the responses. I really appreciate the help.

So I've been fiddling with it, and both read and readline bring in 50 bytes at a time as described (I can get that part working). However, all of the data I need will be within the first 100 characters (was using 50 before just to get the hang of it).

Right now using read and readline I am grabbing the first 100 characters, then the next 100 characters, which either grabs text from the same line, or it can grab text from the next line. Would there be a way to grab the first 100 characters of a line, then move onto the next line ignoring everything after the first 100 characters of that line? Basically truly only looking at the first 100 characters of each line?

For example, here are the first 100 characters of the text file I am looking at (for the line I want):

{|2013-11-26_11.50.21|LOGIN|1384639152462.653|ATOMIC|2013-11-26_11.50.21|#|default|(nameRedacted)|B0+9D+B

All the data I want is within the first 100 characters. I just need to grab the date and name of the person for every line containing LOGIN.

Other lines of the text file contain non-login information, and I don't need any of that info so I need to search for lines that contain LOGIN within the first 100 characters (parse it and grab data) and ignore any line not containing it.

  • Comment on Re^4: Out of Memory - Line of TXT too large

Replies are listed 'Best First'.
Re^5: Out of Memory - Line of TXT too large
by roboticus (Chancellor) on Jan 02, 2014 at 14:04 UTC

    MajinMalak:

    There's nothing magical about 50. Choose a larger number so you can zip through the file faster. Say 100,000 for instance.

    Note: One difficulty that you'll want to be aware of is that you may break a line right in the middle of where your regexp would match, causing you to miss the data. One fix for that is that when you read the next buffer, keep the last X characters of the previous buffer and append it to the beginning of the string before trying the match. Something like:

    $/ = \100000; my $prev = ''; while (<$FH>) { my $line = $prev . $_; if ($line =~ /a funky regex/) { do some work here; } # Adjust 100 upwards if regex can match something longer than 100 +chars $prev = substr($line,-100); }

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re^5: Out of Memory - Line of TXT too large
by Corion (Patriarch) on Jan 02, 2014 at 14:00 UTC

    No - you don't know where a line ends until you have read everything up to the newline character.

    You can speed up the process by remembering the offsets in the file where a new line begins (tell), or by making an educated guess where the line could end, and seeking to there and reading data from there.

Re^5: Out of Memory - Line of TXT too large
by Anonymous Monk on Jan 02, 2014 at 15:13 UTC
    Keep track of your integer position within the file as you read the file in large but arbitrarily-sized chunks. Look in each chunk to see if a newline character falls within it (look for the first byte of any two-byte sequence). If you find one, calculate the offset within the file where it falls, then seek to that. Rinse and repeat.