Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks I have my codes as below to search for the index of each word in my dictionary and replace the word in the input file (corpus) with its index. the main problem is that for the multi-words such as cross-reference it will give me as output something like 232-8574 while I have the term cross-reference in dictionary with it's uniq term. how to avoid this?
#!/usr/bin/perl use warnings; use strict; open (DATA, "dictionary") || die "Error opening the input file\n"; print "Reading mapping file\n"; print "----------------------------\n"; open (INFILE, "corpus.txt") || die "Error opening the input file\n"; print "Reading input file\n"; print "----------------------------\n"; my %dict; while ( <DATA> ) { my ( $key, $val ) = /^(\d+)\s+(\w+)/; $val = lc($val); $dict{ $val } = $key; } my $cc = join '', keys %dict; my ( $min ) = my ( $max ) = map length, keys %dict; for ( map length, keys %dict ) { $min = $_ if $min > $_; $max = $_ if $max < $_; } my $pattern = qr/\b([$cc]{$min,$max})\b/; while (my $line = <INFILE>) { my $old_string = $line; ( my $new_string = $old_string ) =~ s/$pattern/ exists $dict{ $1 } ? $ +dict{ $1 } : $1 /eg; print "$new_string"; } close (INFILE); close (DATA);
__DATA__ 1 cross 2 reference 3 cross-reference __INFILE__ cross-reference __OUTPUT__ 3-2 __EXPECTED-OUTPUT__ 3

Replies are listed 'Best First'.
Re: search and replace
by lakshmananindia (Chaplain) on Apr 03, 2009 at 12:50 UTC

    You have to change your regular expression here

    while ( <DATA> ) { my ( $key, $val ) = /^(\d+)\s+(\w+)/; $val = lc($val); $dict{ $val } = $key; }

    Try printing print Dumper \%dict and you can find the mistake

    When I printed the $key and $value the result is as follows

    my %dict; while ( <DATA> ) { my ( $key, $val ) = /^(\d+)\s+(\w+)/; print "$key\t$val\n"; $val = lc($val); $dict{ $val } = $key; } __END__ 1 cross 2 reference 3 cross

    So the $dict{cross}=>1 is getting overwrite

    --Lakshmanan G.

    The great pleasure in my life is doing what people say you cannot do.


Re: search and replace
by almut (Canon) on Apr 03, 2009 at 13:04 UTC

    Next thing that's a little curious it the regex you compile for subsitution.  With the other regex fixed (e.g. my ( $key, $val ) = /^(\d+)\s+(\S+)/), you'd have:

    ([cross-referencereferencecross]{5,15})

    The character class [...] is almost certainly not what you want.

    Anyway, why not simply do a hash lookup using the value of the (chomp'ed) $line, such as "cross-reference"?  Something like

    while (my $line = <INFILE>) { chomp($line); print $dict{$line} || $line, "\n"; }
      I modified the %dict reg exp as this and now its fine but still i get the error like this:
      Invalid [] range "s-r" in regex; marked by <-- HERE in m/\b([cross-r < +-- HERE eferencereferencecross]{5,15})\b/

        That's because of the curious character class :)

        Within a character class you can use "-" to declare ranges, such as [a-z], but as "s" is later in the alphabet than "r" (somewhat simplified, but I won't go into unicode, locales and stuff here...), the range is invalid.

        I'm not going to elaborate on how to fix this, because I'm not convinced yet, that you need this regex at all...

      the main purpose is to find all words that are seperated by space from my INFILE and replace them by their index. in your method its could work if it is one word per line.
        ...all words that are seperated by space...

        Ah... something you didn't mention in your OP.

        In this case, you could do:

        while (my $line = <INFILE>) { $line =~ s/(\S+)/$dict{$1} || $1/eg; print $line; }

        In case zero is a valid index, too, you'd need

        $line =~ s/(\S+)/defined $dict{$1} ? $dict{$1} : $1/eg;

        Or, if you're using Perl 5.10, you could write instead

        $line =~ s|(\S+)|$dict{$1} // $1|eg;