texuser74 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I want to abbreviate certain words found in my input.txt based on the entries found in jabb.txt and write the output to output.txt file. The jabb.txt file contains 2 columns separated by tab. 1st column is abbreviated text and the second one is full text.
input.txt This is an example section. And the second line has structural science. jabb.txt eg <tab> example exp <tab> expand sect <tab> section sci <tab> science output.txt (required output) This is an eg sect. And the second line has structural sci.
With the following code i have written, my output.txt contains just the text ‘1’.
use strict; use warnings; open( ABB, '<', 'jabb.txt' ) or die "Couldn't open ABB.\n$!"; open( IN, '<', 'input.txt' ) or die "Couldn't open infile.\n$!"; open( OUT, '>', 'output.out' ) or die "Couldn't open outfile.\n$!"; my @myin = <IN>; while(<ABB>){ if(/(.*?)\t(.*?)\n/){ my $abb=$1; my $full=$2; print "@myin\n"; @myin = s/$full/$abb/g; } } print OUT "@myin\n";
please help me in correcting this script.

Replies are listed 'Best First'.
Re: find and replace
by Corion (Patriarch) on Nov 04, 2005 at 07:43 UTC

    The error in your script is that s/// does not work on arrays. You need to loop over the array and replace the full words with their abbreviations line by line:

    ... for my $line (@myin) { $line =~ s!$full!$abb!g; }; ...

    That code is a bit inefficient as it will loop over the main text many times, and it also has some problems, because it doesn't take care that you only replace full words. For example the word examples would be replaced by egs by your script, which may or may not be what you want. If you only want to replace whole words, use the \b boundary in your regular expression:

    ... for my $line (@myin) { $line =~ s!\b$full\b!$abb!g; }; ...

    Also, by doing the replacements step by step, you can get conflicts if one abbreviation is (or creates) a new word that also has an abbreviation - if you don't want that, you need to do all the replacements at once, which means you will need to build a better regular expression. But as I don't know if that is a problem for you, I will skip on that.

      Thanks Corion, now the output is correct.
Re: find and replace
by Zaxo (Archbishop) on Nov 04, 2005 at 09:02 UTC

    You can translate jabb.txt into a hash and substitute directly like this: (untested)

    my %jabb = do { open my $fh, '<', '/path/to/jabb.txt' or die $!; map { reverse split; } <$fh>; }; my $re = do { local $" = '|'; qr!\b(@{[map { qr/\Q$_\E/ keys %jabb]})\b!; }; open my $in, '<', '/another/path/to/input.txt' or die $!; open my $out, '>', '/the/path/to/output.txt' or die $!; while (<$in>) { s/$re/$jabb{$1}/g; print $out $_; }
    The less tricky part there is reversing the map over split so the full text is the key in %jabb. The really tricky part is combining and escaping the keys of %jabb with '|' to get a big alternation on the keys of %jabb so we can substitute the hash value for the key we find. If the values to substitute for have overlap at the initial position, you should order them so that the longest match is tried first

    After Compline,
    Zaxo

Re: find and replace
by Skeeve (Parson) on Nov 04, 2005 at 08:47 UTC

    What I noticed: You keep your input file in memory while you read your abreviations line by line.

    Depending on the usual size of those files this might be a good choice.

    OTOH: I'd expect an inputfile to be substantially larger than your abreviations.

    if so you might want to consider to keep the abreviations in memory and to loop over the input lines:

    # UNTESTED! use strict; use warnings; open( ABB, '<', 'jabb.txt' ) or die "Couldn't open ABB.\n$!"; open( IN, '<', 'input.txt' ) or die "Couldn't open infile.\n$!"; open( OUT, '>', 'output.out' ) or die "Couldn't open outfile.\n$!"; # prepare abreviations my %abb; while(<ABB>){ # there are shorter ways to do it, but to keep your code... if(/(.*?)\t(.*?)\n/) { $abb{$2}= $1 } } close ABB; # build a RE for quickly(?) finding abbreviatable words. my $abbre= join('|', map "\Q$_\E",keys %abb); $abbre= qr/\b($abbre)\b/; # replace my @myin = <IN>; while(<IN>){ s/$abbre/$abb{$1}/g; print OUT; } close IN; close OUT;

    Update: Thanks to Zaxo's obfu-entry ;-) below I noticed I forgot to escape the list of abbreviatable words.


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: find and replace
by l.frankline (Hermit) on Nov 04, 2005 at 09:41 UTC
    Hi,
    Try this.... it is quite helpful to you.
    here is the code....

       open( ABB, '<', 'jabb.txt' ) or die "Couldn't open ABB.\n$!";
       open( IN, '<', 'input.txt' ) or die "Couldn't open infile.\n$!";
       open( OUT, '>', 'output.out' ) or die "Couldn't open outfile.\n$!";

       my @myin = <IN>;

       while(<ABB>)
       {
            ($abb,$full) = split/\t/,$_;
            chomp($abb);
            chomp($full);
            for ($t=0;$t<=$#myin;$t++)
            {
              $myin[$t] =~s/$full/$abb/g;
            }
       }
        print OUT "@myin\n";

    Regards Franklin
Re: find and replace
by radiantmatrix (Parson) on Nov 04, 2005 at 15:06 UTC

    You actually have two problems. First, regexes don't work on arrays directly. Second, a substitution (s///) returns the number of substitutions made. You don't want $term = s/$full/$abb/g, because the = will assign that count to $term. You want =~ which means "make this regex act on" (sort of).

    This could be easily fixed by slurping, rather than reading your IN into an array.

    # instead of @myin = <IN>; my $myin; { local $/ = undef; $myin = <IN>; #slurp all to scalar } #instead of print "@myin\n"; @myin = s/$full/$abb/g; print "$myin\n"; $myin =~ s/\b$full\b/abb/g
    Update:benizi pointed out my foolishness in capturing word boundaries. The sad thing is, I knew that and did it anyway. That's like the fourth really dumb thing I've done today. Sigh. Thanks, benizi!

    I took the liberty of capturingmatching word boundaries (\b) and adding them back ($1 and $2), since without that code, you'd replace too much. For example, trying to replace all occurances of 'hello' with 'hi', you'd change 'Othello' into 'Othi'. ;-) Checking for word boundaries should help with that.

    <-radiant.matrix->
    A collection of thoughts and links from the minds of geeks
    The Code that can be seen is not the true Code
    "In any sufficiently large group of people, most are idiots" - Kaa's Law

      Word boundaries (\b) are zero-width, so you shouldn't capture them. ($1 and $2 will always be empty in your last line.)

      Thus, $myin =~ s/\b$full\b/$abb/g; is sufficient.