remluvr has asked for the wisdom of the Perl Monks concerning the following question:

Hi to everyone.
I guess this is a stupid newbie question, but still, it's driving me mad. I don't know much about regex and I've been googling around like crazy for the past two days.

Suppose I've a sentence like this one:  word1 is made of word2

and one that is like this one:  word1, which is an important thing because of its use and stuff, is made of word2.

I have to match (and extract) everything that is in between word1 and word2, but the sentence in between has to be no longer of 3 or 4 words, so the first one should match but the second one shouldn't. I've been playing around with regex, but have no idea on how making this work. Sorry for the crappy example but I'm italian and my real data are in Italian. Thanks to everyone who could help me!

Replies are listed 'Best First'.
Re: Regex Problem
by JavaFan (Canon) on Oct 24, 2008 at 08:39 UTC
    That's a non-trivial problem. And the non-triviality in it lies in the definition of "word". What is a word to you? Is "can't" a word? Is it two words? What about "Smith-Jones"? One word? Or two?

    Now, suppose you have regexes for words, and strings of non-word characters, it becomes easy:

    my $word_pat = qr /.../; my $non_word_pat = qr /.../; if (/word1${non_word_pat}(?:${word_pat}${non_word_pat}){0,4}word2/) { ... no more than 4 words between word1 and word2 ... }
    But you have to create the $word_pat and $non_word_pat patterns yourself, as I do not know what you consider a word, and what not.
      We decided to take words made up of just one actual word. So can't is not a word and neither is Smith-Jones.
      As for what I've tried is not much, really:
      \s?word1 [\s \w{1,20}]{0,4} word2
      but I guess I messed up pretty badly..
        So, if "can't" is not a word, then you want a match on:
        word1 one two three can't can't can't can't can't can't word2
        because there are no more than 3 words between word1 and word2?
Re: Regex Problem
by Krambambuli (Curate) on Oct 24, 2008 at 08:58 UTC
    Maybe the following is a bit towards what you want; what exactly is a word would probably need to be specified in more detail:
    #!/usr/bin/perl use strict; use warnings; while (<DATA>) { my $line = $_; my $not = ''; if ($line !~ /word1(([,\s]+\w+)?){1,4}\s+word2/) { $not = 'NOT '; } print "$line \tdoes $not match the regexp.\n"; } __DATA__ word1 is made of word2 word1, which is an important thing because of its use and stuff, is ma +de of word2.
    prints out
    word1 is made of word2 does match the regexp. word1, which is an important thing because of its use and stuff, is ma +de of word2. does NOT match the regexp.

    Krambambuli
    ---
Re: Regex Problem
by ccn (Vicar) on Oct 24, 2008 at 08:38 UTC
    something like this:
    $_ = $sentence; print $1,"\n" if /word1(.+)word2/ && 4 < @{[split '\s', $1]};
Re: Regex Problem
by parv (Parson) on Oct 24, 2008 at 08:39 UTC
    What did you try?