BlueStarry has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks and everyone!

For the sake of information this question has been posted on stackexchange too.

I'm dealing with regexp on a textfile. The problem is that i need to match strings that are splitted in 2 lines. Something like this:

Regexp: /kitten/ (or something more complicated is the same) text file: (this is only an example my original file is a huge 5Gb txt with long lines)
sushikitten --> match ilovethekit --> i've lost one no match (BAD) tensushithe kittenisthe ---> ok again
If there's a solution without taking chunks of the file and loading them on a string, or without strange buffers and things like that would be great; because sometimes the matching string in my case can be long as *multiple* lines of the file.

Replies are listed 'Best First'.
Re: Regexp matching on a multiline file: dealing with line breaks
by Athanasius (Archbishop) on Dec 06, 2015 at 04:48 UTC

    Hello BlueStarry, and welcome to the Monastery!

    If the entire file will fit in memory, a variation on kennethk’s solution is to simply delete the newlines before searching:

    #! perl use strict; use warnings; my $target = 'kitten'; my $string = do { local $/; <DATA>; }; $string =~ s/\n//g; my $count = () = $string =~ /\Q$target/g; print "The target string '$target' occurs $count times in the file\n"; __DATA__ sushikitten ilovethekit tensushithe kittenisthe

    Output:

    14:28 >perl 1474_SoPW.pl The target string 'kitten' occurs 3 times in the file 14:28 >

    However, as your input file is 5 GB, this approach is probably impractical. In which case you’re going to have to bite the bullet and implement a solution with “strange buffers” — such as a sliding window technique. Maybe have a look at Data::Iterator::SlidingWindow.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Hi can you please elaborate more on this statement: my $string =  do { local $/; <DATA>; };

        We want to read the whole file (in this case, the contents of the __DATA__ section at the end of the script) into the scalar variable $string. Using the diamond operator, a call to <DATA> reads the next line from the filehandle.

        So to read the whole file at once, we need to tell Perl that a “line” is the whole file. In Perl, the special variable $/ (also called $INPUT_RECORD_SEPARATOR and $RS) specifies what terminates a “line,” and undef is a special value which means “read the whole file at once.” See perlvar#Variables-related-to-filehandles.

        Since $/ is a global variable, changing its value can have far-reaching consequences across a large program. It’s therefore good practice to localize any changes to just that part of the code where they’re required. Hence the idiom of declaring the variable with the local declaration and limiting the scope of that declaration by enclosing it in a block. We could say:

        my $string; { local $/; $string = <DATA>; }

        but wrapping it up in a do block is neater and more concise.

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Regexp matching on a multiline file: dealing with line breaks
by kennethk (Abbot) on Dec 05, 2015 at 22:29 UTC
    Thank you for letting us know you cross posted; please include the link next time.

    In this case, I will assume you are loading the entire file into memory; if you are not, you will need to work with "strange buffers and things like that". It looks like your format has no whitespace dependence, so one easy way to do it is to allow arbitrary whitespace in your regex:

    /k\s*i\s*t\s*t\s*e\s*n/;
    which could be written more maintainably as:
    my $string = 'kitten'; my $regex = join '\s*', split //, $string; /$regex/;

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: Regexp matching on a multiline file: dealing with line breaks
by Laurent_R (Canon) on Dec 06, 2015 at 09:45 UTC
    Well, your file is fairly large and may or may not fit in memory. If it does fit, then a regex allowing newlines and using the g modifier will do, as shown by other monks.

    If it does not fit in memory, then you probably have to bite the bullet and use strange buffers and things like that, but don't be too afraid of that, these "strange things", such as a sliding window, do not need to be very complicated and can be implemented in just 3 or 4 lines of code.

      Many thanks to everyone.

      I'll go with sliding windows but first probably i've got an idea myself, but i don't know if it's correct. My original file is divided in many "paragraphs" every one of them starting with a special line like this

      >Header
      What if i load in memory (in a single string?) this chunks that i'm sure they'll fit in memory and work with them one at a time ignoring the \n?
        Yes, by all means, if you can identify sections or chunks where you can be sure that there cannot be an overlapping match on the chunk boundary, then you don't even need a sliding window: just load and process one chunk after another just the same way you've been told before for the whole file, it is even simpler than a sliding window.

        As Laurent_R says, this is an excellent strategy. Have a look at the entry for $INPUT_RECORD_SEPARATOR (usually spelled just $/) in perlvar. For example:

        #! perl use strict; use warnings; my $target = 'kitten'; my $count = 0; $/ = ">Header\n"; { local $/ = ">Header\n"; while (my $string = <DATA>) { $string =~ s/\n//g; print "string is '$string'\n"; $count += () = $string =~ /\Q$target/g; } } print "The target string '$target' occurs $count times in the file\n"; __DATA__ >Header sushikitten ilovethekit tensushithe kittenisthe >Header sushikittAn ilovethekit tensushithe kittBnisthe

        Output:

        23:11 >perl 1474_SoPW.pl string is '>Header' string is 'sushikittenilovethekittensushithekittenisthe>Header' string is 'sushikittAnilovethekittensushithekittBnisthe' The target string 'kitten' occurs 4 times in the file 23:11 >

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,