in reply to Re: How to count substitutions on an array
in thread How to count substitutions on an array

AnomalousMonk,

I appreciate the time you have put into forming such detailed responses. I must admit that I do not fully understand this one, but may come back to review it later to see what I can learn from it. However, in my case, I don't think the substitution list itself can be better arranged.

My situation has, at present, no regex expressions, only pure textual substitutions; however, it has substitutions of varying lengths requiring the longer ones to come first. The longest ones will be several sentences in length.

For example, I might wish to substitute "every one" with "everyone" AFTER I have already substituted "every one of them" with "every one of them" (no change, but the substitution itself will get padded so that it will not match subsequently, thus preserving it from being incorrectly changed to "everyone" in that case). Another case would be to change "sore athirst" to "very thirsty" BEFORE changing all "athirst" instances to "thirsty." In my case, I am ordering the longest substitutions to take place first. I don't think Regex::Assemble would properly handle this. Efficiency must take second priority, though it is important because ongoing edits will require many executions of the script. I am needing to count each substitution so that I can checksum with the original files to ascertain the correct substitutions have indeed taken place.

  • Comment on Re^2: How to count substitutions on an array

Replies are listed 'Best First'.
Re^3: How to count substitutions on an array
by AnomalousMonk (Archbishop) on Aug 14, 2016 at 00:51 UTC

    You say you are quite satisfied with your current solution, but perhaps this may be of interest for future reference.

    ... no regex expressions, only pure textual substitutions ... substitutions of varying lengths requiring the longer ones to come first. ... no change, but the substitution itself will get padded so that it will not match subsequently ...

    This raises a point I had overlooked before. It's possible to add longest-first discrimination when building an alternation. I've also made an attempt to add some acceptance of variable whitespace to the solution. There's also a feature to skip over certain phrases. This avoids the substitution of a substring with itself just to step over it in a possibly expensive no-op. (The weird capitalization is just to emphasize the substituted bits.)

    c:\@Work\Perl>perl -le "use 5.010; ;; use warnings; use strict; ;; my @skip_over = ( 'every one of them', 'all in good time', ); ;; my ($skip) = map qr{ \b (?: $_ ) \b (*SKIP) (*FAIL) }xms, join q{ | }, map qr{ \Q$_\E }xms, sort { length($b) <=> length($a) } @skip_over ; print qq{\$skip: $skip \n}; ;; my %direct_substitution = ( 'every one' => 'EVERYONE', 'sore athirst' => 'Very Thirsty', 'athirst' => 'THIRSTY', 'all in' => 'GONZO', ); ;; my ($capture) = map qr{ \b (?: $_ ) \b }xms, join q{ | }, map qr{ \Q$_\E }xms, sort { length($b) <=> length($a) } keys %direct_substitution ; print qq{\$capture: $capture \n}; ;; my $line = qq{every one wang chung every one \n} . qq{of them are sore athirst if not well athirst. \n} . qq{all in good time we will enjoy all in wrestling. \n} ; print qq{[[$line]] \n}; ;; $line =~ s{ \s+ }' 'xmsg; print qq{(($line)) \n}; ;; my $count = 0; $count += $line =~ s{ ($skip | $capture) } {$direct_substitution{$1}}xmsg; ;; print qq{substitutions: $count}; print qq{<<$line>> \n}; " $skip: (?^msx: \b (?: (?^msx: every\ one\ of\ them ) | (?^msx: all\ in +\ good\ time ) ) \b (*SKIP) (* FAIL) ) $capture: (?^msx: \b (?: (?^msx: sore\ athirst ) | (?^msx: every\ one +) | (?^msx: athirst ) | (?^msx : all\ in ) ) \b ) [[every one wang chung every one of them are sore athirst if not well athirst. all in good time we will enjoy all in wrestling. ]] ((every one wang chung every one of them are sore athirst if not well +athirst. all in good time we will enjoy all in wrestling. )) substitutions: 4 <<EVERYONE wang chung every one of them are Very Thirsty if not well T +HIRSTY. all in good time we will enjoy GONZO wrestling. >>
    (Some long output lines have been arbitrarily wrapped when composing this post.) Of course, I intend each  $line to be an element in an array over which you're looping.

    Update: Of course, it's possible to get rid of the
        $line =~ s{ \s+ }' 'xmsg;
    whitespace collapsing step and make the code even more whitespace agnostic.


    Give a man a fish:  <%-{-{-{-<