Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: use regular expressions across multiple lines from a very large input file

by LanX (Saint)
on Dec 05, 2010 at 18:29 UTC ( [id://875504]=note: print w/replies, xml ) Need Help??


in reply to use regular expressions across multiple lines from a very large input file

Hi

I will only sketch an algorithm and leave the programming to you.

I think you should read and process text chunks of size n, e.g. 1024 or 4096 bytes. ²

Whenever you process one chunk you need to append the m first bytes of the next chunk with m=200+l and l the number of characters of your keyword string minus 1, that is 21 for "these are my keywords".

Like this your regex will match all occurrences where at least the first character of the keyword string is still in the chunk.

Of course you need to normalize the chunks and keywords by replacing s/s+/ /g

If your regex is too complicated to be normalized you can still do it by joining two - reasonably big (!)³ successive chunks, but you need either to memorize the match position to exclude duplicated hits or change the regex to only allow matches starting within the first chunk. (e.g. by checking pos)

Cheers Rolf

1) now you could even use index instead of a regex

2) here efficiency depends on the block size of your filesystem. see seek for how to read chunks.

3) a chunk must be bigger than the size of the longest possible match. Now quantifiers like \s+ indicate potentially infinite long matches. Are they really wanted??? Either make a reasonable limit like \s{,20} or you have to normalize your chunks by replacing s/\s+/ /g.

Replies are listed 'Best First'.
Re^2: use regular expressions across multiple lines from a very large input file
by CountZero (Bishop) on Dec 06, 2010 at 06:56 UTC
    In order to speed up the search, I dare to suggest to choose a large value or n, say a value slightly less than the amount that causes the "Out of Memory" error.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      In order to speed up the search, I dare to suggest to choose a large value of n,

      Don't assume that the bigger the read, the faster it will run, it just doesn't work out that way.

      On my systems, 64kb reads work out marginally best (YMMV):

      C:\test>junk -B=4 < 1gb.dat Found 6559 matches in 10.778 seconds using 4 kb reads C:\test>junk -B=64 < 1gb.dat Found 6559 matches in 10.567 seconds using 64 kb reads C:\test>junk -B=256 < 1gb.dat Found 6559 matches in 10.574 seconds using 256 kb reads C:\test>junk -B=1024 < 1gb.dat Found 6559 matches in 10.938 seconds using 1024 kb reads C:\test>junk -B=4096 < 1gb.dat Found 6559 matches in 10.995 seconds using 4096 kb reads C:\test>junk -B=65536 < 1gb.dat Found 6559 matches in 12.533 seconds using 65536 kb reads

      Code:

      #! perl -slw use strict; use Time::HiRes qw[ time ]; our $B //= 64; $/ = \( $B *1024 ); binmode STDIN, ':raw:perlio'; my $start = time; my $count = 0; while( <STDIN> ) { ++$count while m[123]g; } printf "Found %d matches in %.3f seconds using %d kb reads\n", $count, time()-$start, $B;

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Yes, these things are always quite tricky to predict.

        So many things may influence it: buffer-size, cache-effects, other processes stealing the disk-controller from your script or pushing your script out to disk, ...

        Trial & Error is probably the only way to go.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      > In order to speed up the search, I dare to suggest to choose a large value or n, say a value slightly less than the amount that causes the "Out of Memory" error.

      I think you mean half that size.

      Cheers Rolf

        Why half the size?

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re^2: use regular expressions across multiple lines from a very large input file
by Anonymous Monk on Dec 05, 2010 at 19:18 UTC
      Yes more or less.

      AFAI see this example doesn't handle the maximal possible length of a match, which must be smaller than one block.

      Cheers Rolf

      Great. THanks for the pointers.
Re^2: use regular expressions across multiple lines from a very large input file
by rizzy (Sexton) on Dec 07, 2010 at 04:02 UTC
    Thanks for the suggestion, Rolf.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://875504]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (6)
As of 2024-03-28 09:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found