How to count substitutions on an array

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to count substitutions on an array by Limbic~Region (Chancellor) on Aug 13, 2016 at 02:08 UTC
Anonymous Monk, Why are you trying to cram everything into a single line? `my $count = 0; for (@array) { # I've had no difficulty counting substitutions on individual lin +e replacements # my $individual_line_replacement_count = ...; $count += $individual_line_replacement_count; }` [download] Cheers - L~R	[reply] [d/l]
Re^2: How to count substitutions on an array by Anonymous Monk on Aug 13, 2016 at 02:29 UTC
I didn't know how to do it in any way other than a nested "foreach" loop. On an array with tens of thousands of items, and needing to iterate through the entire array for a list of thousands of words/phrases, I might have been waiting a long time. I wish it were easier to search online for specific coding issues with programming. Sigh. Your solution works for me. Thank you very much.	[reply]
Re: How to count substitutions on an array by Marshall (Canon) on Aug 13, 2016 at 03:19 UTC
Consider this to count the number of substitutions: `#!/usr/bin/perl use warnings; use strict; my $x = "xyzzyblahxyzzymoreBLahxyzzyasdfouyXYZZY"; print "Input = $x\n"; my $count = $x =~ s/xyzzy//gi; print "count = $count\n"; print "result = $x\n\n"; $x = "blahxyzzy"; print "Input = $x\n"; $count = $x =~ s/xyZzy//gi; print "count = $count\n"; print "result = $x\n"; __END__ PRINTS: Input = xyzzyblahxyzzymoreBLahxyzzyasdfouyXYZZY count = 4 result = blahmoreBLahasdfouy Input = blahxyzzy count = 1 result = blah` [download] A foreach loop over each array element is fine. Do not mistake fewer source lines with more efficiency.	[reply] [d/l]
Re^2: How to count substitutions on an array by Anonymous Monk on Aug 13, 2016 at 05:09 UTC
Marshall, Are you saying that a foreach loop would be just as fast as what Limbic-Region was suggesting for an array? Consider the following solution concept, which is what I had before, that I don't think is nearly so fast (though I didn't time it precisely). `foreach $replace (@substitutionlist) { ($oldline, $newline) = split(/\t/, $replace); $newline = s/\s/_SPACE_/g; #….more code here foreach $line (@array) { $count += s/\b$oldline\b/$spliced/eg for @array; #….more code here } } s/_SPACE_/ /g for @array;` [download] Consider that @substitutionlist has over 10,000 lines, and @array has over 30,000, with many lines requiring multiple substitution replacements on a single line. Is the foreach setup outlined above really just as efficient?	[reply] [d/l]
Re^3: How to count substitutions on an array by AnomalousMonk (Archbishop) on Aug 13, 2016 at 14:29 UTC
`$count += s/\b$oldline\b/$spliced/eg for @array;` A small point: I don't understand why you're using the `/e` modifier in this substitution. It "works" (i.e., produces the same result) just as well with it as without, but if you're concerned about speed, I don't see how firing up the interpreter for each and every substitution is going to help. Is `$spliced` just a placeholder for significant code that you don't want to show? Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: How to count substitutions on an array by AnomalousMonk (Archbishop) on Aug 13, 2016 at 18:23 UTC
`foreach $line (@array) {` `$count += s/\b$oldline\b/$spliced/eg for @array;` `# ... more code here` `}` Another point to consider with this block of code is that the `$count += s/\b$oldline\b/$spliced/eg for @array;` statement is executed for each and every element of the `foreach $line (@array) { ... }` loop, but unless something tricky is going on in the `# ... more code here` section, every execution of `$count += s/.../.../eg for @array;` after the first will have nothing to do: every substitution will have already been made on the first execution, i.e., with the processing of the very first element of the outer loop. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^4: How to count substitutions on an array by Anonymous Monk on Aug 13, 2016 at 19:11 UTC
Re^3: How to count substitutions on an array by Marshall (Canon) on Aug 14, 2016 at 14:55 UTC
The OP (Original Post) showed 2 lines of code. Neither one of which tallied the number of substitutions properly. My examples showed how to tally that for an individual line in very explict terms as I thought that was the "problem". Both of your OP's example lines contain a "foreach" loop. "for" is just a shorthand for "foreach". "map" is a kind of a foreach loop. My post and Limbic~Region's are similar in advice. "A foreach is fine", means that you save nothing by "disguising" the foreach by writing it on a single line. The "foreach" is still there. Are you saying that a foreach loop would be just as fast as what Limbic-Region was suggesting for an array? Limbic~Region's "for" statement IS a "foreach loop". The OP didn't mention anything about running 10K regex'es on 30K lines! Although, it sounds like from subsequent posts that you have a solution that meets your needs in terms of performance. I don't really understand your application, but if this is some type of word for word substitution situation, a hash based approach would be faster. But that is mute if you are happy with what you have.	[reply]
Re: How to count substitutions on an array by BillKSmith (Monsignor) on Aug 13, 2016 at 16:38 UTC
Your first guess was close, but you need scalar context. I chose to let $spliced "double" the matched string. (It did require /ee to accomplish this. `use strict; use warnings; my $oldline = qr/xyzzy/; my $spliced = q/"$1$1"/; my @array = ( 'xyzzy blah xyzzy moreBLah xyzzy asdfouyXYZZY', 'blah xyzzy', ); my $count; $count += s/\b($oldline)\b/$spliced/gee for @array; $" = "\n"; print "@array\n\n";` [download] OUTPUT: `xyzzyxyzzy blah xyzzyxyzzy moreBLah xyzzyxyzzy asdfouyXYZZY blah xyzzyxyzzy 4` [download] Bill	[reply] [d/l] [select]
Re: How to count substitutions on an array by AnomalousMonk (Archbishop) on Aug 13, 2016 at 18:30 UTC
If your substitutions are all literal substitutions, i.e., all of the form `'banana' => 'plum'` `'xyzzy' => 'whatever'` and never of the form `/f[eio]e+/ => 'something'` and definitely not the sort of thing BillKSmith is doing here, then the following approach might be helpful. If you have many substitutions to make, the alternation of the `$search` regex may grow quite large, but with the alternation trie optimization of Perl version 5.10, this should not be a problem — unless it gets really big! (My guess is that 10K search-replace pairs could be handled.) c:\@Work\Perl>perl -le "use 5.010; ;; use warnings; use strict; ;; my %direct_substitution = ( 'apple' => 'PEAR', 'red' => 'YELLOW', 'xyzzy' => 'SOME OTHER THING', ); ;; my ($search) = map qr{ \b (?: $_ ) \b }xms, join q{ \| }, keys %direct_substitution ; print $search; ;; my @lines = ( 'apple xapple apple applex xapplex apple', 'red', 'xxx red apple yyyy xyzzy zz', ); ;; my $count = 0; $count += s{ ($search) }{$direct_substitution{$1}}xmsg for @lines; ;; print qq{substitutions: $count}; print qq{'$_'} for @lines; " (?^msx: \b (?: apple \| red \| xyzzy ) \b ) substitutions: 7 'PEAR xapple PEAR applex xapplex PEAR' 'YELLOW' 'xxx YELLOW PEAR yyyy SOME OTHER THING zz' [download] Of course, you would go through your `@substitutionlist` array to build the `%direct_substitution` hash before processing the `@lines` array. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: How to count substitutions on an array by Anonymous Monk on Aug 13, 2016 at 19:35 UTC
AnomalousMonk, I appreciate the time you have put into forming such detailed responses. I must admit that I do not fully understand this one, but may come back to review it later to see what I can learn from it. However, in my case, I don't think the substitution list itself can be better arranged. My situation has, at present, no regex expressions, only pure textual substitutions; however, it has substitutions of varying lengths requiring the longer ones to come first. The longest ones will be several sentences in length. For example, I might wish to substitute "every one" with "everyone" AFTER I have already substituted "every one of them" with "every one of them" (no change, but the substitution itself will get padded so that it will not match subsequently, thus preserving it from being incorrectly changed to "everyone" in that case). Another case would be to change "sore athirst" to "very thirsty" BEFORE changing all "athirst" instances to "thirsty." In my case, I am ordering the longest substitutions to take place first. I don't think Regex::Assemble would properly handle this. Efficiency must take second priority, though it is important because ongoing edits will require many executions of the script. I am needing to count each substitution so that I can checksum with the original files to ascertain the correct substitutions have indeed taken place.	[reply]
Re^3: How to count substitutions on an array by AnomalousMonk (Archbishop) on Aug 14, 2016 at 00:51 UTC
You say you are quite satisfied with your current solution, but perhaps this may be of interest for future reference. ... no regex expressions, only pure textual substitutions ... substitutions of varying lengths requiring the longer ones to come first. ... no change, but the substitution itself will get padded so that it will not match subsequently ... This raises a point I had overlooked before. It's possible to add longest-first discrimination when building an alternation. I've also made an attempt to add some acceptance of variable whitespace to the solution. There's also a feature to skip over certain phrases. This avoids the substitution of a substring with itself just to step over it in a possibly expensive no-op. (The weird capitalization is just to emphasize the substituted bits.) c:\@Work\Perl>perl -le "use 5.010; ;; use warnings; use strict; ;; my @skip_over = ( 'every one of them', 'all in good time', ); ;; my ($skip) = map qr{ \b (?: $_ ) \b (SKIP) (FAIL) }xms, join q{ \| }, map qr{ \Q$_\E }xms, sort { length($b) <=> length($a) } @skip_over ; print qq{\$skip: $skip \n}; ;; my %direct_substitution = ( 'every one' => 'EVERYONE', 'sore athirst' => 'Very Thirsty', 'athirst' => 'THIRSTY', 'all in' => 'GONZO', ); ;; my ($capture) = map qr{ \b (?: $_ ) \b }xms, join q{ \| }, map qr{ \Q$_\E }xms, sort { length($b) <=> length($a) } keys %direct_substitution ; print qq{\$capture: $capture \n}; ;; my $line = qq{every one wang chung every one \n} . qq{of them are sore athirst if not well athirst. \n} . qq{all in good time we will enjoy all in wrestling. \n} ; print qq{[[$line]] \n}; ;; $line =~ s{ \s+ }' 'xmsg; print qq{(($line)) \n}; ;; my $count = 0; $count += $line =~ s{ ($skip \| $capture) } {$direct_substitution{$1}}xmsg; ;; print qq{substitutions: $count}; print qq{<<$line>> \n}; " $skip: (?^msx: \b (?: (?^msx: every\ one\ of\ them ) \| (?^msx: all\ in +\ good\ time ) ) \b (SKIP) ( FAIL) ) $capture: (?^msx: \b (?: (?^msx: sore\ athirst ) \| (?^msx: every\ one +) \| (?^msx: athirst ) \| (?^msx : all\ in ) ) \b ) [[every one wang chung every one of them are sore athirst if not well athirst. all in good time we will enjoy all in wrestling. ]] ((every one wang chung every one of them are sore athirst if not well +athirst. all in good time we will enjoy all in wrestling. )) substitutions: 4 <<EVERYONE wang chung every one of them are Very Thirsty if not well T +HIRSTY. all in good time we will enjoy GONZO wrestling. >> [download] (Some long output lines have been arbitrarily wrapped when composing this post.) Of course, I intend each `$line` to be an element in an array over which you're looping. Update: Of course, it's possible to get rid of the `$line =~ s{ \s+ }' 'xmsg;` whitespace collapsing step and make the code even more whitespace agnostic. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: How to count substitutions on an array by davido (Cardinal) on Aug 14, 2016 at 00:29 UTC
`$count = () = s///` is a single expression that repeats for each element in `@array`. Apparently the final element in `@array` matches but once. If there were two patterns in the final element of `@array` matching your pattern, then the could would be 2. On each iteration of your `for` loop the expression is evaluated against a new `$_` input, and `$count` is replaced. You do not employ an accumulator of any kind. If you wanted to do this as a single line of code you could do this: `do {$count = () = $s/PATTERN/REPLACEMENT/g; $acc += $count} for @array +;` [download] Update: Apparently it's not my day. See below. Thanks AnomalousMonk. In the end, `$acc` should contain your total number of matches. Keep in mind that your `/e` modifier is probably wrong, and maybe a risk. If the replacement comes from somewhere outside of your control, it could be used to pass a destructive command through the Perl interpreter. Dave	[reply] [d/l] [select]
Re^2: How to count substitutions on an array by AnomalousMonk (Archbishop) on Aug 14, 2016 at 01:08 UTC
`do {$count = () = $s/PATTERN/REPLACEMENT/g; $acc += $count} for @array;` But `s///` returns the number of substitutions made (or the empty string if none), so when list-ified this number (or the empty string) will always be a single-elemment list (update: which in scalar context will always be a `$count` of 1), so the `$acc` total will always be the number of elements of the `@array` array. `c:\@Work\Perl>perl -wMstrict -le "my @array = qw(PAT PATxPATxPAT PATxPATxPATxPATxPAT xxxxx); ;; my $acc; do { my $count = () = s/PAT/REP/g; $acc += $count; } for @array; ;; print qq{\$acc: $acc \@array: (@array)}; " $acc: 4 @array: (REP REPxREPxREP REPxREPxREPxREPxREP xxxxx)` [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: How to count substitutions on an array by davido (Cardinal) on Aug 14, 2016 at 02:42 UTC
Thanks. You caught me! Here are some idioms that could be preferable: `use strict; use warnings; my @data = qw(foozbar foozball zoobeezoobedo); my $acc = 0; do { my $count = s/o//g; $acc+=$count; } for @data; print "$acc\n"; # But the above really can just boil down to: my $count; $count += s/\/o/g for @data; print "$count\n"; # List::Util's reduce is always fun. use List::Util 'reduce'; my $total = reduce {$a += $b =~ s/o/*/g} 0, @data; print "$total\n";` [download] Hope this helps (and thanks again for catching the error). Dave	[reply] [d/l]
Re: How to count substitutions on an array by Anonymous Monk on Aug 15, 2016 at 18:51 UTC
Two remarks: you don't need the /e modifier as the replacement is interpreted (with // at least); if you need a sum of numbers, you can use the `sum` from List::Util. `Use List::Util qw( sum ); $count = sum map s/$foo/$bar/g, @array;` [download] (But use `sum0` if `@array` can be empty and `undef` won't do.)	[reply] [d/l] [select]