Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re: Matching in huge files

by Anonymous Monk
on Dec 08, 2004 at 16:58 UTC ( #413259=note: print w/replies, xml ) Need Help??

in reply to Matching in huge files

dws - this is great - really fast, however, I am having problems when the window happens to split the string you are searching for in two, so can never be found - how can this be fixed? e.g. searching for 'findme' window1 = "blah blah blah fin" window2 = "dme blah blah blah"

Replies are listed 'Best First'.
Re^2: Matching in huge files
by dws (Chancellor) on Dec 09, 2004 at 17:11 UTC

    I am having problems when the window happens to split the string you are searching for in two, so can never be found - how can this be fixed?

    The algorithm uses a sliding window, and matches strings that fall within that (sliding) window. If you're trying to match a string that doesn't fit in the window, make the window larger. Or if you think you've found a problem, post a test case that demonstrates the failure.

      Yep, it is very fast, but why is that better than this:
      open(F, "<", $file) or die "$file: $!"; binmode(F); undef $/; # switch off end-of-line separating # read file in large chunks while (<F>) { while ( m/$re/oigsm ) { print "$1\n"; } } $/ = '\n'; # switch back to line mode close(F);



        but why is that better than this: ...

        My fragment doesn't assume that the huge file will fit in memory, and it matches across read boundaries. Your approach sets up for a single-read slurp.

        In addition to the answer that dws has already given, the original approach is better because it doesn't assume that the IRS was previously "\n", and it certainly doesn't put the IRS back as a literal \n (not a newline character, because of the single quotes). The usual idiom for changing $/ is to wrap any changes to it in a block, and then localise within the block.

        Also, maybe it's just my unfamiliarity with binmode, but I think that undef-ing the IRS means that the while loop only ever runs once.

      Making the window large only limits the posibilities of having the string you are looking for cut in two, there is no real way to prevent this from happening unless you are looking for a fixed size string. In that case you could always keep that fixed size of the old window and append the new window to the old. This way if there was an intersection you have just undone it, make sure to move your position back accordingly.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://413259]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (2)
As of 2022-06-26 17:27 GMT
Find Nodes?
    Voting Booth?
    My most frequent journeys are powered by:

    Results (86 votes). Check out past polls.