Regex matching

b_vulnerability has asked for the wisdom of the Perl Monks concerning the following question:

So, I'm back asking your help. I hope I don't make too many mistakes posting this.
My problem is this: I have a .txt file, which is like this:

L'/lo/RDNS  harmonium/harmonium/S  Ë/essere/V-S3IP  uno/uno/RIMS  stru
+mento/strumento/S-MS  musicale/musicale/A-NS  azionato/azionare/V-MSP
+R  con/con/E  una/una/RIFS  tastiera/tastiera/S-FS
[download]

which is the result of a POS tagging.
I also have a list of couple of word, that I made, which looks like this one:

[Nn]ucle[oi]/nucleo/S-MS:[Pp]roton[oi]/protone/S-MP
[Aa]tom[oi]/atomo/S-MS:[Nn]ucle[oi]/nucleo/S-MS
[download]

I need to do something that takes both file as input and print me everything that's between the first "word" which is really a string and the second one.
I tried this code:

#!/usr/bin/perl
use strict;
use warnings;
open my $listaParole,"<File_with_the_words" or die;
 
my %hash;
while (my $line=<$listaParole>) {
chomp $line;
my ($word1, $word2) = split /:/, $line;

$hash{$word1} = $word2;
}
while ( my ($k,$v) = each %hash ) {
    print "Key $k => $v\n";
}

open my $testo, "<File_with_the_text";  
open my $lista_relazioni, ">Output";
my @arrayris =() ;
my $indice=0;

    while (my $text=<$testo>){
     for my $key (keys %hash){
       my $value = $hash{$key};
    while ($text =~ / $key (.*)? $value /g)  {
      
    $arrayris[$indice]=$1;
    $indice++;
      
    
    } 
      
    }  
}
my $indice_controllo=1;
foreach (@arrayris) {
  print $lista_relazioni "($indice_controllo) $_\n";
  $indice_controllo++;
}




close $testo;
close $lista_relazioni;
[download]

but obviously the regex ($text =~ / $key (.*)? $value /g) isn't too accurate. I would like for it to match just string that have the first "word" than 3 or 4 other "words" and than the second one. Is it possible? How should I do it? Is the rest of the code any good? Thanks for your help.

Comment on Regex matching Select or Download Code

Replies are listed 'Best First'.
Re: Regex matching by missingthepoint (Friar) on Nov 04, 2008 at 01:16 UTC
First of all, thanks for posting your code. It's good to see you're trying (and trying to learn). And good on you for `use`-ing `strict` and `warnings`. So, I'm back asking your help. Not a problem - "everyone is a newbie at something". I hope I don't make too many mistakes posting this. I guess you're referring to Re: Perl starter with big problem.. Don't take that personally; it's a criticism of your post title, not you as a person. Do take the criticism on board (i.e. learn from it) - parv had a good reason for saying it: namely, the Monastery is full of very skilled people with many demands on their time. If you make a post, the only way they can judge whether the post aligns with their interest and expertise is by the title. A vague title gives them nothing to go on - every other post in Seekers of Perl Wisdom is by a "Perl starter with a big problem". A better title would have been How can I replace TAB chars with '/'. (Another reason for not using vague titles is that it makes it hard for others to search the Monastery for solutions to the same problem you were having.) It's still hard to figure out exactly what you're after. Can you please post (some of) the contents of the file 'Output' after the script runs, and tell us what's wrong with it? Also, a few random points about your code: You can safely use `my $i` where you used `$indice` and then re-use `$i` (without the `my`) where you used `$indice_controllo`) - `$i` is easier to type and just as clear On line 13 you say `my ($k,$v) = each %hash` but then on lines 23 and 24 you say `for my $key (keys %hash) ... my $value = $hash{$key}`. You had it better the first time :) Should probably check the return codes of your last two `open`s as well HTH, mtp Indicators of geekdom: You get a kick out of finishing sentences with domain names, because of the syntactic overlap of certain written languages and DNS notation. You understood the previous sentence.	[reply] [d/l] [select]
Re^2: Regex matching by b_vulnerability (Novice) on Nov 04, 2008 at 08:49 UTC
I'm very happy to learn. I have not taken what parv said before badly. I know the title was wrong and I understand the reason, because I've spent a lot of time serching archives of this site for coming up with a solution to my problem. I've serached a lot and found this: How do I extract all text between two keywords like start and end? which is exactly what I need to do. The only problem is: I need to extract only sentences that have just three or four word between said keywords, and not match the other. Let's say that my keywords are `atomo` and `nucleo`. I'd like to extract just this: `atomo/atomo/S-MS Ë/essereV-S3IP composto/comporre/V-MSPR da/da/E un/un/RIMS nucleo/nucleo/S` but not this: atomo/atomo/S-MS non/non/B era/essereV-S3II indivisibile/indivisibile/A-NS ,/,/PU bensÏ/bensÏ/C a/a/E sua/suo/A-FS volta/volta/S-FS composto/comporre/V-MSPR da/da/E particelle/particella/S-FP pi˘/pi˘/B piccole/piccolo/A-FP (/(/PU alle/a/E-FP quali/quale/P-NP ci/ci/PQNP si/si/PQNN riferisce/riferireV-S3IP con/con/E il/il/RDMS termine/termine/S-MS "/"/PU subatomiche/subatomico/A-FP "/"/PU )/)/PU ././PU In/in/E particolare/particolare/S-MS ,/,/PU l'/lo/RDNS /atomo/atomo/S-MS Ë/essereV-S3IP composto/comporre/V-MSPR da/da/E un/un/RIMS nucleo/nucleo/S. I managed (more or less) to do what is suggested in the link I posted before, but I don't know how to tell the regex that I don't want every single sentence that start with a keyword and end with the other, but just sentences that have three or four words between the first and the second. I really don't know how to explain this in other words, and I'm sorry if my examples aren't clear enough. I'm a newbie and I'm italian (so my english is far from perfect). Thanks again to everyone.	[reply] [d/l] [select]
Re^3: Regex matching by Cristoforo (Curate) on Nov 04, 2008 at 21:08 UTC
but I don't know how to tell the regex that I don't want every single sentence that start with a keyword and end with the other, but just sentences that have three or four words between the first and the second. To get that result, you could change `while ($text =~ / $key (.)? $value /g)` to `while ($text =~ /\s$key\s+((?:\S+\s+){0,3}\S+)\s+$value\s/g)` This captures 1 to 4 words. (Do you really want space before $key and after $value?) A few questions more. Can you match the same key and value more than once on the same string? Can more than one set of key/value be matched on the same string? I ask because that is what your code is doing now in the section `while (my $text=<$testo>){ for my $key (keys %hash){ my $value = $hash{$key}; while ($text =~ / $key (.)? $value /g) { $arrayris[$indice]=$1; $indice++; } } }` [download] If not, you would need to change the while loop, (testing the regular expression, not the file read),to an if statement. Just a few questions. Chris	[reply] [d/l] [select]
Re: Regex matching by JavaFan (Canon) on Nov 03, 2008 at 16:29 UTC
Could you give an example of your intended result? All you give us is one line from one file, and two lines from the other, with no obvious matches.	[reply]
Re^2: Regex matching by b_vulnerability (Novice) on Nov 03, 2008 at 16:47 UTC
Oh, you're right. I'm pretty new to PerlMonks (signed in this morning) and I'm not sure what I shoul or I shouldn't do. Ok, I'd like to get something like this: Word 1 is, say, `[Aa]tom[oi]/atomo/S-MS` and word2 is `[Nn]ucle[oi]/nucleo/S-MS`. I'd like to have an output like this: `atomo/S-MS Ë/essereV-S3IP composto/comporre/V-MSPR da/da/E un/un/RIMS nucleo/nucleo/S-MS` As you can see, between the first and the second "word" there are just four other "words". Extracting stuff like this one is exactly my goal. Thanks again	[reply] [d/l] [select]