search and replace

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks I have my codes as below to search for the index of each word in my dictionary and replace the word in the input file (corpus) with its index. the main problem is that for the multi-words such as cross-reference it will give me as output something like 232-8574 while I have the term cross-reference in dictionary with it's uniq term. how to avoid this?

#!/usr/bin/perl
use warnings;
use strict;

open (DATA, "dictionary") || die "Error opening the input file\n"; 
print "Reading mapping file\n";
print "----------------------------\n";
    
open (INFILE, "corpus.txt") || die "Error opening the input file\n"; 
print "Reading input file\n";
print "----------------------------\n";

my %dict;
while ( <DATA> ) {

    my ( $key, $val ) = /^(\d+)\s+(\w+)/;
    $val = lc($val);
    $dict{ $val } = $key;
    }

my $cc = join '', keys %dict;
my ( $min ) = my ( $max ) = map length, keys %dict;

for ( map length, keys %dict ) {
    $min = $_ if $min > $_;
    $max = $_ if $max < $_;
    }

my $pattern = qr/\b([$cc]{$min,$max})\b/;


while (my $line = <INFILE>) {
my $old_string = $line;


( my $new_string = $old_string ) =~ s/$pattern/ exists $dict{ $1 } ? $
+dict{ $1 } : $1 /eg;
print "$new_string";

}
close (INFILE);
close (DATA);
[download]

__DATA__
1 cross
2 reference
3 cross-reference

__INFILE__
cross-reference

__OUTPUT__
3-2

__EXPECTED-OUTPUT__
3
[download]

Comment on search and replace Select or Download Code

Replies are listed 'Best First'.
Re: search and replace by lakshmananindia (Chaplain) on Apr 03, 2009 at 12:50 UTC
You have to change your regular expression here `while ( <DATA> ) { my ( $key, $val ) = /^(\d+)\s+(\w+)/; $val = lc($val); $dict{ $val } = $key; }` [download] Try printing print Dumper \%dict and you can find the mistake When I printed the $key and $value the result is as follows `my %dict; while ( <DATA> ) { my ( $key, $val ) = /^(\d+)\s+(\w+)/; print "$key\t$val\n"; $val = lc($val); $dict{ $val } = $key; } __END__ 1 cross 2 reference 3 cross` [download] So the $dict{cross}=>1 is getting overwrite --Lakshmanan G. The great pleasure in my life is doing what people say you cannot do.	[reply] [d/l] [select]
Re: search and replace by almut (Canon) on Apr 03, 2009 at 13:04 UTC
Next thing that's a little curious it the regex you compile for subsitution. With the other regex fixed (e.g. `my ( $key, $val ) = /^(\d+)\s+(\S+)/`), you'd have: `([cross-referencereferencecross]{5,15})` [download] The character class `[...]` is almost certainly not what you want. Anyway, why not simply do a hash lookup using the value of the (`chomp`'ed) `$line`, such as `"cross-reference"`? Something like `while (my $line = <INFILE>) { chomp($line); print $dict{$line} \|\| $line, "\n"; }` [download]	[reply] [d/l] [select]
Re^2: search and replace by Anonymous Monk on Apr 03, 2009 at 13:57 UTC
I modified the %dict reg exp as this and now its fine but still i get the error like this: `Invalid [] range "s-r" in regex; marked by <-- HERE in m/\b([cross-r < +-- HERE eferencereferencecross]{5,15})\b/` [download]	[reply] [d/l]
Re^3: search and replace by almut (Canon) on Apr 03, 2009 at 14:27 UTC
That's because of the curious character class :) Within a character class you can use `"-"` to declare ranges, such as `[a-z]`, but as `"s"` is later in the alphabet than `"r"` (somewhat simplified, but I won't go into unicode, locales and stuff here...), the range is invalid. I'm not going to elaborate on how to fix this, because I'm not convinced yet, that you need this regex at all...	[reply] [d/l] [select]
Re^2: search and replace by Anonymous Monk on Apr 03, 2009 at 14:10 UTC
the main purpose is to find all words that are seperated by space from my INFILE and replace them by their index. in your method its could work if it is one word per line.	[reply]
Re^3: search and replace by almut (Canon) on Apr 03, 2009 at 14:21 UTC
...all words that are seperated by space... Ah... something you didn't mention in your OP. In this case, you could do: `while (my $line = <INFILE>) { $line =~ s/(\S+)/$dict{$1} \|\| $1/eg; print $line; }` [download] In case zero is a valid index, too, you'd need `$line =~ s/(\S+)/defined $dict{$1} ? $dict{$1} : $1/eg;` [download] Or, if you're using Perl 5.10, you could write instead `$line =~ s\|(\S+)\|$dict{$1} // $1\|eg;` [download]	[reply] [d/l] [select]