Re: Regexp matching on a multiline file: dealing with line breaks

Hello BlueStarry, and welcome to the Monastery!

If the entire file will fit in memory, a variation on kennethk’s solution is to simply delete the newlines before searching:

#! perl
use strict;
use warnings;

my $target = 'kitten';
my $string =  do { local $/; <DATA>; };
   $string =~ s/\n//g;
my $count  = () = $string =~ /\Q$target/g;

print "The target string '$target' occurs $count times in the file\n";

__DATA__
sushikitten
ilovethekit
tensushithe
kittenisthe
[download]

Output:

14:28 >perl 1474_SoPW.pl
The target string 'kitten' occurs 3 times in the file

14:28 >
[download]

However, as your input file is 5 GB, this approach is probably impractical. In which case you’re going to have to bite the bullet and implement a solution with “strange buffers” — such as a sliding window technique. Maybe have a look at Data::Iterator::SlidingWindow.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

Comment on Re: Regexp matching on a multiline file: dealing with line breaks Select or Download Code

Replies are listed 'Best First'.

Re^2: Regexp matching on a multiline file: dealing with line breaks
by Anonymous Monk on Dec 06, 2015 at 13:17 UTC

my $string = do { local $/; <DATA>; };

[reply]
[d/l]

Re^3: Regexp matching on a multiline file: dealing with line breaks

by Athanasius (Cardinal) on Dec 06, 2015 at 13:43 UTC

We want to read the whole file (in this case, the contents of the __DATA__ section at the end of the script) into the scalar variable $string. Using the diamond operator, a call to <DATA> reads the next line from the filehandle.

So to read the whole file at once, we need to tell Perl that a “line” is the whole file. In Perl, the special variable $/ (also called $INPUT_RECORD_SEPARATOR and $RS) specifies what terminates a “line,” and undef is a special value which means “read the whole file at once.” See perlvar#Variables-related-to-filehandles.

Since $/ is a global variable, changing its value can have far-reaching consequences across a large program. It’s therefore good practice to localize any changes to just that part of the code where they’re required. Hence the idiom of declaring the variable with the local declaration and limiting the scope of that declaration by enclosing it in a block. We could say:

my $string;

{
    local $/;
    $string = <DATA>;
}
[download]

but wrapping it up in a do block is neater and more concise.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]