comment on

I am parsing millions of text files most of which are relatively small, but some of which cause an "out of memory!" error when using slurp, due to their size. I have been using slurp because I want to save about 200 characters before and after a keyword phrase and the text and the phrase itself may include newlines. It wasn't clear to me how to do this using line-by-line processing. Here's an example:

input.txt file (with newlines noted):

Here is my text file\n
I want to save a bunch of\n
charcaters before the keywords\n
for example the keywords might be\n
the phrase: these are my keywords\n
I want to save a bunch of characters\n
after the keywords too so I have\n
context\n
\n
The keywords may appear multiple\n
times in any given file and may\n
span across lines like so: these are\n
my keywords.  This is one reason\n
I was using slurp instead of reading\n
in line by line
[download]

I have been slurping the file to a string and using regular expressions to find a fixed number of characters (in this example 30) before and after like so:

#!C:/Perl/bin -w
use File::Slurp;

my $filetext= read_file("input.txt");

while($filetext=~ m{(.{30}(these\s+are\s+my\s+keywords).{30})}gis)
{
print "$1\n";
}
[download]

This will spit out something like this:

keywords might be
the phrase: these are my keywords
I want to save a bunch of cha
ay
span across lines like so: these are
my keywords.  This is one reason
I was us
[download]

Is there a more efficient way to do this (i.e., save 200 characters before and after a keyphrase) than to read the entire file into an array? It seems like reading this in line by line will not allow me to pull characters before and after newlines very easily. A workaround that I've been thinking of doing would be to read the filesize and skip the large files which I will process separately, but I imagine there is a better way.

In reply to use regular expressions across multiple lines from a very large input file by rizzy

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.