in reply to Re: Regex matching
in thread Regex matching

I'm very happy to learn. I have not taken what parv said before badly. I know the title was wrong and I understand the reason, because I've spent a lot of time serching archives of this site for coming up with a solution to my problem.
I've serached a lot and found this: How do I extract all text between two keywords like start and end? which is exactly what I need to do. The only problem is: I need to extract only sentences that have just three or four word between said keywords, and not match the other.
Let's say that my keywords are atomo and nucleo. I'd like to extract just this:
atomo/atomo/S-MS  Ë/essereV-S3IP  composto/comporre/V-MSPR  da/da/E  un/un/RIMS  nucleo/nucleo/S
but not this:
 atomo/atomo/S-MS  non/non/B  era/essereV-S3II  indivisibile/indivisibile/A-NS  ,/,/PU  bensÏ/bensÏ/C  a/a/E  sua/suo/A-FS  volta/volta/S-FS  composto/comporre/V-MSPR  da/da/E  particelle/particella/S-FP  pi˘/pi˘/B  piccole/piccolo/A-FP  (/(/PU  alle/a/E-FP  quali/quale/P-NP  ci/ci/PQNP  si/si/PQNN  riferisce/riferireV-S3IP  con/con/E  il/il/RDMS  termine/termine/S-MS  "/"/PU  subatomiche/subatomico/A-FP  "/"/PU  )/)/PU  ././PU  In/in/E  particolare/particolare/S-MS  ,/,/PU  l'/lo/RDNS /atomo/atomo/S-MS  Ë/essereV-S3IP  composto/comporre/V-MSPR  da/da/E  un/un/RIMS  nucleo/nucleo/S.
I managed (more or less) to do what is suggested in the link I posted before, but I don't know how to tell the regex that I don't want every single sentence that start with a keyword and end with the other, but just sentences that have three or four words between the first and the second.
I really don't know how to explain this in other words, and I'm sorry if my examples aren't clear enough. I'm a newbie and I'm italian (so my english is far from perfect). Thanks again to everyone.

Replies are listed 'Best First'.
Re^3: Regex matching
by Cristoforo (Curate) on Nov 04, 2008 at 21:08 UTC
    but I don't know how to tell the regex that I don't want every single sentence that start with a keyword and end with the other, but just sentences that have three or four words between the first and the second.

    To get that result, you could change
    while ($text =~ / $key (.*)? $value /g)

    to

    while ($text =~ /\s$key\s+((?:\S+\s+){0,3}\S+)\s+$value\s/g)

    This captures 1 to 4 words. (Do you really want space before $key and after $value?)

    A few questions more.

    • Can you match the same key and value more than once on the same string?
    • Can more than one set of key/value be matched on the same string?
    I ask because that is what your code is doing now in the section
    while (my $text=<$testo>){ for my $key (keys %hash){ my $value = $hash{$key}; while ($text =~ / $key (.*)? $value /g) { $arrayris[$indice]=$1; $indice++; } } }

    If not, you would need to change the while loop, (testing the regular expression, not the file read),to an if statement.

    Just a few questions.
    Chris