An algorithm that might work is this. Build an array from the lines of the small file, having removed the timestamps.

open SMALL, '<small file' or die "Couldn't open smalllfile; $!\n"; my @find = map { $_ =~ s/^\d{2}:\d{2}(.*)$/$1/; } <SMALL>; close SMALL or warn "Couldn't close smallfile; $!\n";

Then its just a case of running through the large file one line at a time, striping the timstamp, Checking it against the next line in the array. If the line matches you increment the match number, if it failes you reset it to zero and continue.

open BIG, '<bigfile' or die "Couldn't open bigfile; $!\n"; my ($match, $matched) = (0,0); while ( <BIG> ) { s/^\d{2}:\d{2}(.*)$/$1/; $match=0 and next unless $_ eq @find[match++]; next unless $match == scalar @find; # If you got this far, you've matched all the lines in the smallfi +le # as contigeous lines in the bigfile. So do something... # If you need to know where the sequence of matching lines started + (in the bigfile) # $. = scalar @find will tell you. $matched = 1; } close BIG or warn "Couldn't close bigfile; $!"; print "Didn't find the contents of smallfile in bigfile\n" unless $mat +ched;

Two possible problems arising because you were vague with the requirements.

  1. You mention 10 .. 20 lines in the small file but only show 2 lines in your sample. If this was to save space and you always want to match every line in the smallfile before you decide you have a match, great this will work ok. If you need to match anyone of a series of sequences held in the smallfile, you'd need to decide how you will determine how many lines make a sequence.

    An array of Arrays could be one way forward if this is the case.You then also need another scalar to index your way through the AoA's.

  2. You mention but don't elusidate upon the idea of approximate matching. Without further information on how approximate and in what way, this is difficult to address, but for example you might build a regex from the words contained in each of the lines in the smallfile, possibly excluding common and/or small words something like this.

(Assume you already populated the @find array as above.)

my @excluded = qw( a did encountered process ); # tailor as appropr +iate for my $line (@lines) { local $"= '.*?'; #" # break the line into an array of words minus exclusions my @words = grep{ !(1+index($excluded,$_)); } $line =~ m/\b\w+\b/g # replace each line with a fuzzy matching compiled regex $line = qr"@word"o; }

Then the line

$match=0 and next unless $_ eq @find[match++];

becomes

$match=0 and next unless $_ =~ @find[match++];

Using this process, your two line sample would become regexes

(?i-xsm:abc.*?problem) (?i-xsm:abc.*?restart)

and would case-independantly match any line containing the process name and the second word in that order regardless of intervening words.

By tailoring the list of excluded words to your trace file, this should be a fairly powerful fuzzy-match mechanism.

Another idea I had would be to construct a regex from the lines something like

for my $line (@lines) { # break the line into an array of words minus exclusions my @words = grep{ !(1+index($excluded,$_)); } $line =~ m/\b\w+\b/g # Word breaks to avoid partial word matches local $local $"='\b|\b'; #" # replace each line with a multi-matching compiled regex $line = qr"\b@a\b"oi; }

then use /g to match as many words as possible and obtain a count of the number.

my $n = () = $_ =~ m/@find[match++]/g; $match=0 and next unless $n > $than_some_predetermined_ number;

With this method you would probably need to work out a 'minimum words to be matched number' on a line by line basis in the smallfile. They would probably be best appended to each line and parsed at the same time the timestamp is stripped.


Cor! Like yer ring! ... HALO dammit! ... 'Ave it yer way! Hal-lo, Mister la-de-da. ... Like yer ring!

In reply to Re: Comparing succeeding lines in two files. by BrowserUk
in thread Comparing succeeding lines in two files. by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.