hiddenOx has asked for the wisdom of the Perl Monks concerning the following question:

I have the following problem, I am splitting words into pairs of letters. and try to match it up with another word.

elaboration:
current word is zezl which should give ze - ez - zl

and the word that should get matched to is zezezl which will give the following gram to get matches: ze-ez-ze-ez-zl

the problem here that my expression gets match the first ze twice and the ez twice as it appears in the word twice per pair. ALTHOUGH what should happen that every pair should get matched only once....

I believe that there are some arguments that can make the expression match every pair only once, I tried the \1 to eat up the matched expression but the problem is that if the word has 2 pairs that are the same like ze-ez-ze it will count {ze} only once..

Here is the code:

I really really appreciate any of you help.. I've ran out of ideas and really need your help

BETTER ELABORATION
I will try to elaborate:
keyword:zezlze --> ze ez zl lz ze
dicWord:zezezd --> ze ez ze ez zd

so the score of this part should be 3 as the ze appears twice and each one is matched with the corresponding and the ez is matched once ONLY as second ez the dictionary word hasn't found another match in the keyword

I hope that I explained well, if not, Please let me know.

UPDATED COMPILING CODE:
use strict; use warnings; use re 'eval'; print "Enter word: "; #chomp(my $word = <STDIN>); my $word = "zek"; #read the string and trip it into 2 characters NB: /g to repeat operat +ion at same string my @pairs = $word =~ /(?=(..))/g; print(" \"@pairs\" \n"); #qr to make an expression \1 at the end eats matched! #consider grep perl #my $matcher = qr/(?=(?s-i)(@{[join "|", @pairs]}))/; #my $matcher = qr/(?=(@{[join "|", @pairs]}))/; local $" = q{|}; my $matcher = qr{(?=(@pairs))}; print(" \"$matcher\" \n"); my %coef; open (my $dict, "dictionary.txt") || die "cant open dictionary.txt\n"; my $dictword = "zezezl"; my $matches = () = $dictword =~ /$matcher/g; (print"$_")yes-pattern/g; my $totalz = "$matches $dictword \n"; print $totalz; my $coef = 2 * $matches / (@pairs + length($dictword)-1); print ("coeffesion score is $coef");
Thank you guys

Replies are listed 'Best First'.
Re: force regular expression to match every item only once
by mscharrer (Hermit) on Apr 20, 2008 at 20:39 UTC
    Your code doesn't compile. Please make always sure to post functional code.
    The \1 wouldn't work here, at least I can't think how to use it between different regex calls ala / /g where you apply one regex several times. \1 can be used inside one regex to refer to an already matched part, but you actually don't do this here.
    There is the experimental (?{ code }) construct which allows you to execute code inside the regex where you maybe could check if it already matched.

    One better way would be to first match all patterns multiple times and then filter the list using a hash:

    use strict; use warnings; my $word = "zezl"; my $dictword = "zezezl"; my @pairs = $word =~ /(?=(..))/g; my $matcher = qr/(?=(@{[join "|", @pairs]}))/; my %seen; my $matches = 0; my @matches; foreach my $match ($dictword =~ /$matcher/g) { if (not exists $seen{$match}) { $matches ++; push @matches, $match; } $seen{$match} = 1; } print "Matches: @matches\n"; print "$matches $dictword \n";
    Here we iterate over the results and mark all matches as seen, but only count them the first time. In addition when you need the list of real matches you can push the first matches in an array like shown above.
      The main idea is to get every pair to get matched only once BUT if there are two pairs identical in the keyword and in the dictionary there are 2 identical keywords too, then every keyword will get matched once..

      I will try to elaborate:
      keyword:zezlze --> ze ez zl lz ze
      dicWord:zezezd --> ze ez ze ez zd

      so the score of this part should be 3 as the ze appears twice and each one is matched with the corresponding and the ez is matched once ONLY as second ez the dictionary word hasn't found another match in the keyword

      I hope that I explained well, if not, Please let me know.

      Thank you very much for your help
        Based on my last code, you could add a hash with the number of occurrence of every pair and then check if this pair was already matched this often:
        use strict; use warnings; my $word = "zezlze"; my $dictword = "zezezd"; my @pairs = $word =~ /(?=(..))/g; my %count; $count{$_}++ foreach (@pairs); my $matcher = qr/(?=(@{[join "|", @pairs]}))/; my %seen; my $matches = 0; my @matches; foreach my $match ($dictword =~ /$matcher/g) { if ($seen{$match}++ < $count{$match}) { $matches ++; push @matches, $match; } } print "Matches: @matches\n"; print "$matches $dictword \n";
        results in
        Matches: ze ez ze 3 zezezd
Re: force regular expression to match every item only once
by pKai (Priest) on Apr 20, 2008 at 20:03 UTC

    I'm not exactly sure what you want to achieve, but from what I read I would think you want matching pairs in $matcher instead of zero-length assertions, as in:

    my $matcher = qr/(?:(@{[join "|", @pairs]}))/;

    Which gives 3 matches instead of 5 (likewise for $dictword = "zeezze";.)

    If I' completely wrong, please clean the syntax/stricture errors in your OP first.

      I've updated the code and the current one is compiling right, sorry but didn't have much sleep. The problem still exist actually as I want it when it when one pair get matched once, it can't get matched again so for the current one, it should give only score 2 (ze and ez gets matched once each) and not 4 where it match each pair twice.
Re: force regular expression to match every item only once
by FunkyMonk (Bishop) on Apr 20, 2008 at 20:29 UTC
    It may help to look at dice's coefficient, where a very similar (identical?) problem was discussed.


    Unless I state otherwise, my code all runs with strict and warnings
      The example here has the same fault, it supported me but this problem still exist..

      Hope you could help ....
Re: force regular expression to match every item only once
by johngg (Canon) on Apr 20, 2008 at 22:11 UTC
    Not solving your problem but you could avoid the @{[join "|", @pairs]} by using the fact that regexen act like double quotes when interpolating arrays and changing the default list separator ($").

    my $matcher = do { local $" = q{|}; qr{(?=(@pairs))}; };

    Also, you can avoid having to escape the double quotes in your print statements when you want to interpolate by using quoting constructs (qq{...} for double quotes).

    $ perl -e ' > @arr = qw{a b c}; > print qq{"@arr"\n};' "a b c" $

    I hope these points are useful.

    Cheers,

    JohnGG

Re: force regular expression to match every item only once
by ikegami (Patriarch) on Apr 21, 2008 at 05:28 UTC

    will try to elaborate:
    keyword:zezlze --> ze ez zl lz ze
    dicWord:zezezd --> ze ez ze ez zd
    so the score of this part should be 3 as the ze appears twice and each one is matched with the corresponding and the ez is matched once ONLY as second ez the dictionary word hasn't found another match in the keyword

    Here are two ways:

    my @kw = sort qw( ze ez zl lz ze ); my @dw = sort qw( ze ez ze ez zd ); my $c = 0; my $kw_i = 0; my $dw_i = 0; while ($kw_i < @kw && $dw_i < @dw) { if ($kw[$kw_i] lt $dw[$dw_i]) { ++$kw_i; next; } if ($dw[$dw_i] lt $kw[$kw_i]) { ++$dw_i; next; } ++$c; ++$kw_i; ++$dw_i; }
    my $dw = ' ' . join(' ', qw( ze ez ze ez zd )); my @kw = qw( ze ez zl lz ze ); my $c = 0; my $dw_ = $dw; for my $kw_part (@kw) { my $pos = index($dw_, " $kw_part"); next if $pos < 0; substr($dw_, $pos+1, 2, ''); ++c; }

    Untested.
    I made them non-destructive.

Re: force regular expression to match every item only once
by GrandFather (Saint) on Apr 21, 2008 at 05:06 UTC

    It may be a small modification to the code I provided for dice's coefficient is what you are after. Consider: (note the two lines with the trailing ##)

    use strict; use warnings; use List::Compare; my @words = qw(dictate world mamal zezl); my %dict; # Build a lookup for the dictionary words while (defined (my $word = <DATA>)) { chomp $word; next unless length $word; my %dup; my @bigrams = grep {! $dup{$_}++} $word =~ /(..)/g; ## next unless @bigrams; $dict{$word} = \@bigrams; } # Process the given words for my $word (@words) { my %dup; my @bigrams = grep {! $dup{$_}++} $word =~ /(..)/g; ## next unless @bigrams; for my $dictWord (keys %dict) { my $lc = List::Compare->new($dict{$dictWord}, \@bigrams); my @common = $lc->get_intersection (); my $diceCoef = 2 * @common / (@bigrams + @{$dict{$dictWord}}); next unless $diceCoef; print "Dice coefficient for '$word' and '$dictWord' is $diceCo +ef\n"; } } __DATA__ words zezezl

    Prints:

    Dice coefficient for 'world' and 'words' is 0.5 Dice coefficient for 'zezl' and 'zezezl' is 1

    Perl is environmentally friendly - it saves trees