Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks I just want to ask your idea for an optimized solution to my problem. I have a large text file and a dictionary of terms with their index number. I want to find the optimized way to replace all terms in my text file by their index number read from dictionary. for example :
dictionary: 1 dog 2 cat 3 chased 4 the text file: the dog chased my cat. output: 4 1 3 my 2.
now in my apprach i read the dictionary to two arrays and then search in the array for each word to replace. any other better solution? I'd appreciate your helps.

Replies are listed 'Best First'.
Re: search and replace
by bellaire (Hermit) on Mar 17, 2009 at 11:28 UTC
    Just as a piece of general advice: Rather than two arrays, you should use an associative array, also known as a hash. The keys of the hash should be the words and the values should be the index numbers with which to replace them.
      if the index is just a running number just use only one array  $dict[$column1] = $column2.
      BTW replacement becomes easy (by checking with exists/defined $dict[$number] and then replace).
      UPDATE:
      I just misunderstood the question, pardon me.
      It is solvable using hash itself.

      Vivek
      -- In accordance with the prarabdha of each, the One whose function it is to ordain makes each to act. What will not happen will never happen, whatever effort one may put forth. And what will happen will not fail to happen, however much one may seek to prevent it. This is certain. The part of wisdom therefore is to stay quiet.
        This wouldn't work as well because he needs to find the words in his string, and use the numbers to replace them. That is, the index should be the data from the input file, which is words, not numbers. Using the numbers as the index would force him to search the entire list for each word which needs to be replaced (although it would save him one array compared with his current two array approach). Using the words themselves as hash keys allows him to do a simple lookup.
Re: search and replace
by VinsWorldcom (Prior) on Mar 17, 2009 at 12:46 UTC

    Don't know if my approach is any better, but I had the same issue many years ago. Had a tab-delimited file that needed to be search and replaced for multiple terms in multiple columns. Instead of running a search and replace in Excel over and over for each item, I wrote a Perl script to read a mapping file (search term \t replace term) and do a "single pass" global search and replce

    I've updated over the years to not just do columns. You may find it useful. It's in the Code area:

    MSAR.pl

    UPDATE:

    {C} > more in.txt the dog chased my cat. {C} > more mapping.txt dog 1 cat 2 chased 3 the 4 {C} > msar in.txt mapping.txt Reading mappings from file: mapping.txt ---------------------------- 4 1 3 my 2. ---------------------------- Mapped 4 entries. {C} >
      I found a problem in your codes. While I have a text as below:
      april barrel
      and a dictionary like this:
      225 April 1168 barrel 3143 Il 9432 PR ....
      I get this as output form my input
      a94323143 11777rr340
      I dont know why april would be broken to a + PR + Il ...

        It happens because the map file is read into a hash and normally there is no "order" to a hash. Thus, you can't guarantee that the search and replace will happen in the order you give in you map file.

        You're actually hitting this part of the code:

        # user didn't specify columns, so just SAR each line and leave + alone } else { # loop through mapping array for each line foreach my $replace (keys(%map)) { # ignore case? if (defined($opt_ignore)) { $YESMapping += ($_ =~ s/$replace/$map{$replace}/gi +) } else { $YESMapping += ($_ =~ s/$replace/$map{$replace}/g) } } print $OUT $_ }

        Stick in a helpful print to "debug" what's going on:

        # user didn't specify columns, so just SAR each line and leave + alone } else { # loop through mapping array for each line foreach my $replace (keys(%map)) { # ignore case? if (defined($opt_ignore)) { print "SAR on $_ with $replace\n"; $YESMapping += ($_ =~ s/$replace/$map{$replace}/gi +) } else { $YESMapping += ($_ =~ s/$replace/$map{$replace}/g) } } print $OUT $_ }

        This is what we see using your input and mapping files:

        {C} > msar input.txt map.txt -r -i Reading mappings from file: map.txt ---------------------------- SAR on april with il SAR on apr3143 with barrel SAR on apr3143 with april SAR on apr3143 with pr a94323143 SAR on barrel with il SAR on barrel with barrel SAR on 1168 with april SAR on 1168 with pr 1168 ---------------------------- Mapped 3 entries.

        You could maybe fix it by adding in Tie::Hash (I think) which is supposed to be able to order your hash. You would need to manipulate the hash variable %map when it is loaded at the beginning of the program. Unfortunately, I don't have the time now to code this up, but hey, my Perl code is "open source" :-) so have at it!

        UPDATE: If your infile is just the one column of words, call with:

        {C} > msar input.txt map.txt -r -i -c 1

        {C} > msar.pl in.txt map.txt -i -r -c 1 Reading mappings from file: map.txt ---------------------------- 225 1168 ---------------------------- Mapped 2 entries.

        UPDATE: MSAR.pl code now updated to use -w option which replaces on WHOLE WORDS only. Also, map.txt file will be read AND parsed AND used in search and replace in the order it is written (line 1, line 2 ... line n).

      Thanks ... nice work :)

        I just realized that I have your "mapping file" backwards, so I added a "-r" (reverse) option so you can keep your mapping file the way you have it and still use the program. I just uploaded the new code about 5 mins ago (called version 1.31 dated 17 MAR 2009) so check that out if you haven't already.

Re: search and replace
by bichonfrise74 (Vicar) on Mar 17, 2009 at 17:03 UTC
    What do you think of this code?

    #!/usr/bin/perl use strict; my $old_string = "the dog chased my cat."; my ($new_string, $found_word); my %dict; while( <DATA> ) { my ($key, $val) = $_ =~ /^(\d)\s(\w+)/; $dict{$val} = $key; } chop( $old_string ); my @words = split( " ", $old_string); foreach my $i ( @words ) { foreach my $j ( keys %dict ) { if ( $j eq $i ) { $new_string = $new_string . "$dict{$j} "; $found_word++; } } $new_string = $new_string . "$i " if ($found_word == 0); $found_word = 0; } chop( $new_string ); print "$new_string."; __DATA__ 1 dog 2 cat 3 chased 4 the

      How about like this:

      #!/usr/bin/perl use warnings; use strict; my $old_string = 'the dog chased my cat.'; my %dict; while ( <DATA> ) { my ( $key, $val ) = /^(\d+)\s+(\w+)/; $dict{ $val } = $key; } my $cc = join '', keys %dict; my ( $min ) = my ( $max ) = map length, keys %dict; for ( map length, keys %dict ) { $min = $_ if $min > $_; $max = $_ if $max < $_; } my $pattern = qr/\b([$cc]{$min,$max})\b/; ( my $new_string = $old_string ) =~ s/$pattern/ exists $dict{ $1 } ? $ +dict{ $1 } : $1 /eg; print "$old_string\n$new_string\n"; __DATA__ 1 dog 2 cat 3 chased 4 the
        I defined it like this:
        #!/usr/bin/perl use warnings; use strict; open (DATA, "dic") || die "Error opening the input file\n"; print "Reading mapping file\n"; print "----------------------------\n"; open (INFILE, "trial.txt") || die "Error opening the input file\n"; print "Reading input file\n"; print "----------------------------\n"; my %dict; while (my $line = <INFILE>) { my $old_string = $line; while ( <DATA> ) { my ( $key, $val ) = /^(\d+)\s+(\w+)/; $dict{ $val } = $key; } my $cc = join '', keys %dict; my ( $min ) = my ( $max ) = map length, keys %dict; for ( map length, keys %dict ) { $min = $_ if $min > $_; $max = $_ if $max < $_; } my $pattern = qr/\b([$cc]{$min,$max})\b/; ( my $new_string = $old_string ) =~ s/$pattern/ exists $dict{ $1 } ? $ +dict{ $1 } : $1 /eg; print "$new_string\n"; } close (INFILE); close (DATA);
        but I get this error:
        Reading mapping file ---------------------------- Reading input file ---------------------------- april 1168 0.06781456 Use of uninitialized value in concatenation (.) or string at seek.pl l +ine 29, <DATA> line 19969. Use of uninitialized value in concatenation (.) or string at seek.pl l +ine 29, <DATA> line 19969. Unmatched [ in regex; marked by <-- HERE in m/\b([ <-- HERE ]{,})\b/ a +t seek.pl line 29, <DATA> line 19969.
        Not that 19969 is my last line in dictionary file. and also how can I ignore the cases in matching? for example in dictionary file April exists but april does not, so I tend to make it case insensetive. thanks in advance.