in reply to Regexp matching on a multiline file: dealing with line breaks

Well, your file is fairly large and may or may not fit in memory. If it does fit, then a regex allowing newlines and using the g modifier will do, as shown by other monks.

If it does not fit in memory, then you probably have to bite the bullet and use strange buffers and things like that, but don't be too afraid of that, these "strange things", such as a sliding window, do not need to be very complicated and can be implemented in just 3 or 4 lines of code.

Replies are listed 'Best First'.
Re^2: Regexp matching on a multiline file: dealing with line breaks
by BlueStarry (Novice) on Dec 06, 2015 at 09:55 UTC

    Many thanks to everyone.

    I'll go with sliding windows but first probably i've got an idea myself, but i don't know if it's correct. My original file is divided in many "paragraphs" every one of them starting with a special line like this

    >Header
    What if i load in memory (in a single string?) this chunks that i'm sure they'll fit in memory and work with them one at a time ignoring the \n?
      Yes, by all means, if you can identify sections or chunks where you can be sure that there cannot be an overlapping match on the chunk boundary, then you don't even need a sliding window: just load and process one chunk after another just the same way you've been told before for the whole file, it is even simpler than a sliding window.

      As Laurent_R says, this is an excellent strategy. Have a look at the entry for $INPUT_RECORD_SEPARATOR (usually spelled just $/) in perlvar. For example:

      #! perl use strict; use warnings; my $target = 'kitten'; my $count = 0; $/ = ">Header\n"; { local $/ = ">Header\n"; while (my $string = <DATA>) { $string =~ s/\n//g; print "string is '$string'\n"; $count += () = $string =~ /\Q$target/g; } } print "The target string '$target' occurs $count times in the file\n"; __DATA__ >Header sushikitten ilovethekit tensushithe kittenisthe >Header sushikittAn ilovethekit tensushithe kittBnisthe

      Output:

      23:11 >perl 1474_SoPW.pl string is '>Header' string is 'sushikittenilovethekittensushithekittenisthe>Header' string is 'sushikittAnilovethekittensushithekittBnisthe' The target string 'kitten' occurs 4 times in the file 23:11 >

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        Thank you very very much i REALLY appreciate your help and dedication on my matter.
        can i ask you a question? I'm having trouble fitting your code to mine because in real life my ">Header" changes every time. It is something like />(.)+?\n/

        i've tried to put the regular expression inside $\ but it doesn't seem to work

        And also i need to save info from the header, and this complicates the stuff more.