use regular expressions across multiple lines from a very large input file

rizzy has asked for the wisdom of the Perl Monks concerning the following question:

I am parsing millions of text files most of which are relatively small, but some of which cause an "out of memory!" error when using slurp, due to their size. I have been using slurp because I want to save about 200 characters before and after a keyword phrase and the text and the phrase itself may include newlines. It wasn't clear to me how to do this using line-by-line processing. Here's an example:

input.txt file (with newlines noted):

Here is my text file\n
I want to save a bunch of\n
charcaters before the keywords\n
for example the keywords might be\n
the phrase: these are my keywords\n
I want to save a bunch of characters\n
after the keywords too so I have\n
context\n
\n
The keywords may appear multiple\n
times in any given file and may\n
span across lines like so: these are\n
my keywords.  This is one reason\n
I was using slurp instead of reading\n
in line by line
[download]

I have been slurping the file to a string and using regular expressions to find a fixed number of characters (in this example 30) before and after like so:

#!C:/Perl/bin -w
use File::Slurp;

my $filetext= read_file("input.txt");

while($filetext=~ m{(.{30}(these\s+are\s+my\s+keywords).{30})}gis)
{
print "$1\n";
}
[download]

This will spit out something like this:

keywords might be
the phrase: these are my keywords
I want to save a bunch of cha
ay
span across lines like so: these are
my keywords.  This is one reason
I was us
[download]

Is there a more efficient way to do this (i.e., save 200 characters before and after a keyphrase) than to read the entire file into an array? It seems like reading this in line by line will not allow me to pull characters before and after newlines very easily. A workaround that I've been thinking of doing would be to read the filesize and skip the large files which I will process separately, but I imagine there is a better way.

Comment on use regular expressions across multiple lines from a very large input file Select or Download Code

Replies are listed 'Best First'.
Re: use regular expressions across multiple lines from a very large input file by BrowserUk (Patriarch) on Dec 05, 2010 at 18:19 UTC
You need a sliding buffer--a supersearch for that term will turn up various implementations. Here's a simple one implemented using an array of lines: #! perl -slw use strict; my @lines; my %seen; while( <DATA> ) { push @lines, $_; my $buf = join '', @lines; if( $buf =~ /(.{30}these\s+are\s+my\s+keywords.{30})/sm ) { print "'$1'" unless $seen{ $1 }; ++$seen{ $1 }; } shift @lines if @lines > 5; } __END__ Here is my text file I want to save a bunch of charcaters before the keywords for example the keywords might be the phrase: these are my keywords I want to save a bunch of characters after the keywords too so I have context The keywords may appear multiple times in any given file and may span across lines like so: these are my keywords. This is one reason I was using slurp instead of reading in line by line [download] Which produces: `C:\test>junk 'eywords might be the phrase: these are my keywords I want to save a bunch of ch' 'y span across lines like so: these are my keywords. This is one reason I was u'` [download] You would probably want to make the context at either end optional so you don't miss matches at the start or end of the file where there may not be enough context to match. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: use regular expressions across multiple lines from a very large input file by rizzy (Sexton) on Dec 07, 2010 at 03:57 UTC
Thanks. I'll look into sliding buffers. This may solve another problem I'm having with memory leaks each time I slurp.	[reply]
Re: use regular expressions across multiple lines from a very large input file by LanX (Saint) on Dec 05, 2010 at 18:29 UTC
Hi I will only sketch an algorithm and leave the programming to you. I think you should read and process text chunks of size n, e.g. 1024 or 4096 bytes. ˛ Whenever you process one chunk you need to append the m first bytes of the next chunk with m=200+l and l the number of characters of your keyword string minus 1, that is 21 for "these are my keywords". Like this your regex will match all occurrences where at least the first character of the keyword string is still in the chunk. Of course you need to normalize the chunks and keywords by replacing `s/s+/ /g`.š If your regex is too complicated to be normalized you can still do it by joining two - reasonably big (!)ł successive chunks, but you need either to memorize the match position to exclude duplicated hits or change the regex to only allow matches starting within the first chunk. (e.g. by checking pos) Cheers Rolf 1) now you could even use index instead of a regex 2) here efficiency depends on the block size of your filesystem. see seek for how to read chunks. 3) a chunk must be bigger than the size of the longest possible match. Now quantifiers like `\s+` indicate potentially infinite long matches. Are they really wanted??? Either make a reasonable limit like `\s{,20}` or you have to normalize your chunks by replacing `s/\s+/ /g`.	[reply] [d/l] [select]
Re^2: use regular expressions across multiple lines from a very large input file by Anonymous Monk on Dec 05, 2010 at 19:18 UTC
Matching in huge files aka sliding window technique	[reply]
Re^3: use regular expressions across multiple lines from a very large input file by LanX (Saint) on Dec 05, 2010 at 21:34 UTC
Yes more or less. AFAI see this example doesn't handle the maximal possible length of a match, which must be smaller than one block. Cheers Rolf	[reply]
Re^3: use regular expressions across multiple lines from a very large input file by rizzy (Sexton) on Dec 07, 2010 at 04:02 UTC
Great. THanks for the pointers.	[reply]
Re^2: use regular expressions across multiple lines from a very large input file by CountZero (Bishop) on Dec 06, 2010 at 06:56 UTC
In order to speed up the search, I dare to suggest to choose a large value or n, say a value slightly less than the amount that causes the "Out of Memory" error. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^3: use regular expressions across multiple lines from a very large input file by BrowserUk (Patriarch) on Dec 06, 2010 at 19:42 UTC
In order to speed up the search, I dare to suggest to choose a large value of n, Don't assume that the bigger the read, the faster it will run, it just doesn't work out that way. On my systems, 64kb reads work out marginally best (YMMV): C:\test>junk -B=4 < 1gb.dat Found 6559 matches in 10.778 seconds using 4 kb reads C:\test>junk -B=64 < 1gb.dat Found 6559 matches in 10.567 seconds using 64 kb reads C:\test>junk -B=256 < 1gb.dat Found 6559 matches in 10.574 seconds using 256 kb reads C:\test>junk -B=1024 < 1gb.dat Found 6559 matches in 10.938 seconds using 1024 kb reads C:\test>junk -B=4096 < 1gb.dat Found 6559 matches in 10.995 seconds using 4096 kb reads C:\test>junk -B=65536 < 1gb.dat Found 6559 matches in 12.533 seconds using 65536 kb reads [download] Code: `#! perl -slw use strict; use Time::HiRes qw[ time ]; our $B //= 64; $/ = \( $B *1024 ); binmode STDIN, ':raw:perlio'; my $start = time; my $count = 0; while( <STDIN> ) { ++$count while m[123]g; } printf "Found %d matches in %.3f seconds using %d kb reads\n", $count, time()-$start, $B;` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^4: use regular expressions across multiple lines from a very large input file by CountZero (Bishop) on Dec 06, 2010 at 23:33 UTC
Re^3: use regular expressions across multiple lines from a very large input file by LanX (Saint) on Dec 06, 2010 at 09:48 UTC
> In order to speed up the search, I dare to suggest to choose a large value or n, say a value slightly less than the amount that causes the "Out of Memory" error. I think you mean half that size. Cheers Rolf	[reply]
Re^4: use regular expressions across multiple lines from a very large input file by CountZero (Bishop) on Dec 06, 2010 at 18:56 UTC
Re^5: use regular expressions across multiple lines from a very large input file by LanX (Saint) on Dec 06, 2010 at 21:12 UTC
Some notes below your chosen depth have not been shown here
Re^2: use regular expressions across multiple lines from a very large input file by rizzy (Sexton) on Dec 07, 2010 at 04:02 UTC
Thanks for the suggestion, Rolf.	[reply]
Re: use regular expressions across multiple lines from a very large input file by ambrus (Abbot) on Dec 06, 2010 at 11:18 UTC
Did you try reading in paragraph mode (`$/ = ""`)? That should work provided that you don't have very long paragraphs and that your search phrase can't be split through paragraphs.	[reply] [d/l]
Re^2: use regular expressions across multiple lines from a very large input file by rizzy (Sexton) on Dec 07, 2010 at 04:03 UTC
I initially thought paragraphs might be the way to go, but these things are all formatted differently and some include html.	[reply]
Re: use regular expressions across multiple lines from a very large input file by locked_user sundialsvc4 (Abbot) on Dec 06, 2010 at 13:33 UTC
Also don’t neglect what existing command-line tools and scripting might be able to do for you. (Even Windows, with their PowerShell, is finally glomming on to this...) For example: `grep -r regex filespec` ... already does a very large part of what you are trying to do. If you could use it simply to grab the matching phrases and “enough of the surrounding real-estate,” you could then filter what `grep` has sent you, to whittle it down into the final answer, using Perl or otherwise.
Re^2: use regular expressions across multiple lines from a very large input file by rizzy (Sexton) on Dec 07, 2010 at 04:01 UTC
Thanks. THe problem is I have thousands of tarred/zipped folders of files which I need to unzip one at a time, parse, and then delete. I haven't been able to convince the unix admin to allow me to store all of these on the server, so I'm using my machine which is running windows.	[reply]