comment on

Here is a modified version (with my test parameters - please reset them to match your current ones).

This version adds a SECOND read for the log file, line-at-a-time, trading I/O for CPU, but should still be pretty fast.

It prints out the line number, and a bunch of other diagnostic/unnecessary info for where the match occurred.

#!/usr/bin/perl -w
#
# Proof-of-concept for using minimal memory to search huge
# files, using a sliding window, matching within the window,
# and using on /gc and pos() to restart the search at the
# correct spot whenever we slide the window.
#
# Doesn't correctly handle potential matches that overlap;
# the first fragment that matches wins.
#

use strict;
use constant BLOCKSIZE => 20; ##(8 * 1024);

my @findoffset;
my $file =  "ascii-code.htm";
search( $file, #"bighuge.log",
        sub { print $_[0], " at offset $_[1]\n"; push @findoffset,$_[1
+]; },
       # "<img[^>]*>");
       "javasc");
       
# Re-read file as lines
$_=0 for my ($line,$offset,$prev,$idx);
open(my $F, "<", $file) or die "$file: $!";
while (<$F>){
   $line++;
   my $len = length($_);
   next unless (($offset+=$len) >= $findoffset[$idx]);
   print "$line,$offset,$findoffset[$idx],$len:\t$_";
   $idx++;
   last if $idx > $#findoffset;
}
close ($F);

#------------------------------------------
sub search {
    my ($file, $callback, @fragments) = @_;

    my $byteoffset = 0;
    
    open(my $F, "<", $file) or die "$file: $!";
    binmode($F);

    # prime the window with two blocks (if possible)
    my $nbytes = read($F, my $window, 2 * BLOCKSIZE);

    my $re = "(" . join("|", @fragments) . ")";

    while ( $nbytes > 0 ) {

        # match as many times as we can within the
        # window, remembering the position of the
        # final match (if any).
        while ( $window =~ m/$re/oigcs ) {
            $callback->($1, $byteoffset);
        }
        my $pos = pos($window);

        # grab the next block
        $byteoffset += $nbytes; 
        $nbytes = read($F, my $block, BLOCKSIZE);
        last if $nbytes == 0;

        # slide the window by discarding the initial
        # block and appending the next. then reset
        # the starting position for matching.
        substr($window, 0, BLOCKSIZE) = '';
        $window .= $block;
        $pos -= BLOCKSIZE;
        pos($window) = $pos > 0 ? $pos : 0;
    }

    close($F);
}
[download]

Update 1: Note - there may be subtle issues (I hate to say bugs) under boundary conditions where multiple matches occur on the same line. Special case code needs to be added to handle these, if tis condition is expected.

Update 2: Thinking about this some more leads me to believe this is not the right way to go about it. It would be a lot more efficient to track newlines on the First read, and buffer/capture/print the lines containing the text right at the spot.

In other words, in addition to passing the Matching $1, the search sub should callback with the line of text, in context. There may be an issue requiring more sliding window buffering, in case the "line" is split across buffers.

"How many times do I have to tell you again and again .. not to be repetitive?"

In reply to Re: Matching lines in 2+ GB logfiles. by NetWallah
in thread Matching lines in 2+ GB logfiles. by dbmathis

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.