in reply to Re: Regexp matching on a multiline file: dealing with line breaks
in thread Regexp matching on a multiline file: dealing with line breaks

Many thanks to everyone.

I'll go with sliding windows but first probably i've got an idea myself, but i don't know if it's correct. My original file is divided in many "paragraphs" every one of them starting with a special line like this

>Header
What if i load in memory (in a single string?) this chunks that i'm sure they'll fit in memory and work with them one at a time ignoring the \n?
  • Comment on Re^2: Regexp matching on a multiline file: dealing with line breaks
  • Download Code

Replies are listed 'Best First'.
Re^3: Regexp matching on a multiline file: dealing with line breaks
by Laurent_R (Canon) on Dec 06, 2015 at 10:01 UTC
    Yes, by all means, if you can identify sections or chunks where you can be sure that there cannot be an overlapping match on the chunk boundary, then you don't even need a sliding window: just load and process one chunk after another just the same way you've been told before for the whole file, it is even simpler than a sliding window.
Re^3: Regexp matching on a multiline file: dealing with line breaks
by Athanasius (Archbishop) on Dec 06, 2015 at 13:12 UTC

    As Laurent_R says, this is an excellent strategy. Have a look at the entry for $INPUT_RECORD_SEPARATOR (usually spelled just $/) in perlvar. For example:

    #! perl use strict; use warnings; my $target = 'kitten'; my $count = 0; $/ = ">Header\n"; { local $/ = ">Header\n"; while (my $string = <DATA>) { $string =~ s/\n//g; print "string is '$string'\n"; $count += () = $string =~ /\Q$target/g; } } print "The target string '$target' occurs $count times in the file\n"; __DATA__ >Header sushikitten ilovethekit tensushithe kittenisthe >Header sushikittAn ilovethekit tensushithe kittBnisthe

    Output:

    23:11 >perl 1474_SoPW.pl string is '>Header' string is 'sushikittenilovethekittensushithekittenisthe>Header' string is 'sushikittAnilovethekittensushithekittBnisthe' The target string 'kitten' occurs 4 times in the file 23:11 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Thank you very very much i REALLY appreciate your help and dedication on my matter.
      can i ask you a question? I'm having trouble fitting your code to mine because in real life my ">Header" changes every time. It is something like />(.)+?\n/

      i've tried to put the regular expression inside $\ but it doesn't seem to work

      And also i need to save info from the header, and this complicates the stuff more.
        $/ and $\ are two different variables. Input ≠ output.

        Also, read $/:

        Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)
        What might work, though, is
        $/ = "\n>";

        You'll need to remove the rest of the header from the the block, though.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,