b_vulnerability has asked for the wisdom of the Perl Monks concerning the following question:

So, I'm back asking your help. I hope I don't make too many mistakes posting this.
My problem is this: I have a .txt file, which is like this:
L'/lo/RDNS harmonium/harmonium/S Ë/essere/V-S3IP uno/uno/RIMS stru +mento/strumento/S-MS musicale/musicale/A-NS azionato/azionare/V-MSP +R con/con/E una/una/RIFS tastiera/tastiera/S-FS
which is the result of a POS tagging.
I also have a list of couple of word, that I made, which looks like this one:
[Nn]ucle[oi]/nucleo/S-MS:[Pp]roton[oi]/protone/S-MP [Aa]tom[oi]/atomo/S-MS:[Nn]ucle[oi]/nucleo/S-MS
I need to do something that takes both file as input and print me everything that's between the first "word" which is really a string and the second one.
I tried this code:
#!/usr/bin/perl use strict; use warnings; open my $listaParole,"<File_with_the_words" or die; my %hash; while (my $line=<$listaParole>) { chomp $line; my ($word1, $word2) = split /:/, $line; $hash{$word1} = $word2; } while ( my ($k,$v) = each %hash ) { print "Key $k => $v\n"; } open my $testo, "<File_with_the_text"; open my $lista_relazioni, ">Output"; my @arrayris =() ; my $indice=0; while (my $text=<$testo>){ for my $key (keys %hash){ my $value = $hash{$key}; while ($text =~ / $key (.*)? $value /g) { $arrayris[$indice]=$1; $indice++; } } } my $indice_controllo=1; foreach (@arrayris) { print $lista_relazioni "($indice_controllo) $_\n"; $indice_controllo++; } close $testo; close $lista_relazioni;
but obviously the regex ($text =~ / $key (.*)? $value /g) isn't too accurate. I would like for it to match just string that have the first "word" than 3 or 4 other "words" and than the second one. Is it possible? How should I do it? Is the rest of the code any good? Thanks for your help.

Replies are listed 'Best First'.
Re: Regex matching
by missingthepoint (Friar) on Nov 04, 2008 at 01:16 UTC

    First of all, thanks for posting your code. It's good to see you're trying (and trying to learn). And good on you for use-ing strict and warnings.

    So, I'm back asking your help.

    Not a problem - "everyone is a newbie at something".

    I hope I don't make too many mistakes posting this.

    I guess you're referring to Re: Perl starter with big problem.. Don't take that personally; it's a criticism of your post title, not you as a person. Do take the criticism on board (i.e. learn from it) - parv had a good reason for saying it: namely, the Monastery is full of very skilled people with many demands on their time. If you make a post, the only way they can judge whether the post aligns with their interest and expertise is by the title. A vague title gives them nothing to go on - every other post in Seekers of Perl Wisdom is by a "Perl starter with a big problem". A better title would have been How can I replace TAB chars with '/'.

    (Another reason for not using vague titles is that it makes it hard for others to search the Monastery for solutions to the same problem you were having.)

    It's still hard to figure out exactly what you're after. Can you please post (some of) the contents of the file 'Output' after the script runs, and tell us what's wrong with it?

    Also, a few random points about your code:

    • You can safely use my $i where you used $indice and then re-use $i (without the my) where you used $indice_controllo) - $i is easier to type and just as clear
    • On line 13 you say my ($k,$v) = each %hash but then on lines 23 and 24 you say for my $key (keys %hash) ... my $value = $hash{$key}. You had it better the first time :)
    • Should probably check the return codes of your last two opens as well

    HTH,
    mtp


    Indicators of geekdom:
    • You get a kick out of finishing sentences with domain names, because of the syntactic overlap of certain written languages and DNS notation.
    • You understood the previous sentence.
      I'm very happy to learn. I have not taken what parv said before badly. I know the title was wrong and I understand the reason, because I've spent a lot of time serching archives of this site for coming up with a solution to my problem.
      I've serached a lot and found this: How do I extract all text between two keywords like start and end? which is exactly what I need to do. The only problem is: I need to extract only sentences that have just three or four word between said keywords, and not match the other.
      Let's say that my keywords are atomo and nucleo. I'd like to extract just this:
      atomo/atomo/S-MS  Ë/essereV-S3IP  composto/comporre/V-MSPR  da/da/E  un/un/RIMS  nucleo/nucleo/S
      but not this:
       atomo/atomo/S-MS  non/non/B  era/essereV-S3II  indivisibile/indivisibile/A-NS  ,/,/PU  bensÏ/bensÏ/C  a/a/E  sua/suo/A-FS  volta/volta/S-FS  composto/comporre/V-MSPR  da/da/E  particelle/particella/S-FP  pi&#728;/pi&#728;/B  piccole/piccolo/A-FP  (/(/PU  alle/a/E-FP  quali/quale/P-NP  ci/ci/PQNP  si/si/PQNN  riferisce/riferireV-S3IP  con/con/E  il/il/RDMS  termine/termine/S-MS  "/"/PU  subatomiche/subatomico/A-FP  "/"/PU  )/)/PU  ././PU  In/in/E  particolare/particolare/S-MS  ,/,/PU  l'/lo/RDNS /atomo/atomo/S-MS  Ë/essereV-S3IP  composto/comporre/V-MSPR  da/da/E  un/un/RIMS  nucleo/nucleo/S.
      I managed (more or less) to do what is suggested in the link I posted before, but I don't know how to tell the regex that I don't want every single sentence that start with a keyword and end with the other, but just sentences that have three or four words between the first and the second.
      I really don't know how to explain this in other words, and I'm sorry if my examples aren't clear enough. I'm a newbie and I'm italian (so my english is far from perfect). Thanks again to everyone.
        but I don't know how to tell the regex that I don't want every single sentence that start with a keyword and end with the other, but just sentences that have three or four words between the first and the second.

        To get that result, you could change
        while ($text =~ / $key (.*)? $value /g)

        to

        while ($text =~ /\s$key\s+((?:\S+\s+){0,3}\S+)\s+$value\s/g)

        This captures 1 to 4 words. (Do you really want space before $key and after $value?)

        A few questions more.

        • Can you match the same key and value more than once on the same string?
        • Can more than one set of key/value be matched on the same string?
        I ask because that is what your code is doing now in the section
        while (my $text=<$testo>){ for my $key (keys %hash){ my $value = $hash{$key}; while ($text =~ / $key (.*)? $value /g) { $arrayris[$indice]=$1; $indice++; } } }

        If not, you would need to change the while loop, (testing the regular expression, not the file read),to an if statement.

        Just a few questions.
        Chris

Re: Regex matching
by JavaFan (Canon) on Nov 03, 2008 at 16:29 UTC
    Could you give an example of your intended result? All you give us is one line from one file, and two lines from the other, with no obvious matches.
      Oh, you're right. I'm pretty new to PerlMonks (signed in this morning) and I'm not sure what I shoul or I shouldn't do. Ok, I'd like to get something like this:
      Word 1 is, say, [Aa]tom[oi]/atomo/S-MS and word2 is [Nn]ucle[oi]/nucleo/S-MS.
      I'd like to have an output like this: atomo/S-MS  Ë/essereV-S3IP  composto/comporre/V-MSPR  da/da/E  un/un/RIMS  nucleo/nucleo/S-MS As you can see, between the first and the second "word" there are just four other "words". Extracting stuff like this one is exactly my goal.
      Thanks again