Re: How to count substitutions on an array
by Limbic~Region (Chancellor) on Aug 13, 2016 at 02:08 UTC
|
Anonymous Monk,
Why are you trying to cram everything into a single line?
my $count = 0;
for (@array) {
# I've had no difficulty counting substitutions on individual lin
+e replacements
# my $individual_line_replacement_count = ...;
$count += $individual_line_replacement_count;
}
| [reply] [d/l] |
|
I didn't know how to do it in any way other than a *nested* "foreach" loop. On an array with tens of thousands of items, and needing to iterate through the entire array for a list of thousands of words/phrases, I might have been waiting a long time.
I wish it were easier to search online for specific coding issues with programming. Sigh.
Your solution works for me. Thank you very much.
| [reply] |
Re: How to count substitutions on an array
by Marshall (Canon) on Aug 13, 2016 at 03:19 UTC
|
Consider this to count the number of substitutions:
#!/usr/bin/perl
use warnings;
use strict;
my $x = "xyzzyblahxyzzymoreBLahxyzzyasdfouyXYZZY";
print "Input = $x\n";
my $count = $x =~ s/xyzzy//gi;
print "count = $count\n";
print "result = $x\n\n";
$x = "blahxyzzy";
print "Input = $x\n";
$count = $x =~ s/xyZzy//gi;
print "count = $count\n";
print "result = $x\n";
__END__
PRINTS:
Input = xyzzyblahxyzzymoreBLahxyzzyasdfouyXYZZY
count = 4
result = blahmoreBLahasdfouy
Input = blahxyzzy
count = 1
result = blah
A foreach loop over each array element is fine.
Do not mistake fewer source lines with more efficiency. | [reply] [d/l] |
|
Marshall,
Are you saying that a foreach loop would be just as fast as what Limbic-Region was suggesting for an array? Consider the following solution concept, which is what I had before, that I don't think is nearly so fast (though I didn't time it precisely).
foreach $replace (@substitutionlist) {
($oldline, $newline) = split(/\t/, $replace);
$newline = s/\s/_SPACE_/g;
#….more code here
foreach $line (@array) {
$count += s/\b$oldline\b/$spliced/eg for @array;
#….more code here
}
}
s/_SPACE_/ /g for @array;
Consider that @substitutionlist has over 10,000 lines, and @array has over 30,000, with many lines requiring multiple substitution replacements on a single line. Is the foreach setup outlined above really just as efficient? | [reply] [d/l] |
|
$count += s/\b$oldline\b/$spliced/eg for @array;
A small point: I don't understand why you're using the /e modifier in this substitution. It "works" (i.e., produces the same result) just as well with it as without, but if you're concerned about speed, I don't see how firing up the interpreter for each and every substitution is going to help. Is $spliced just a placeholder for significant code that you don't want to show?
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
foreach $line (@array) {
$count += s/\b$oldline\b/$spliced/eg for @array;
# ... more code here
}
Another point to consider with this block of code is that the
$count += s/\b$oldline\b/$spliced/eg for @array;
statement is executed for each and every element of the
foreach $line (@array) { ... }
loop, but unless something tricky is going on in the # ... more code here section, every execution of
$count += s/.../.../eg for @array;
after the first will have nothing to do: every substitution will have already been made on the first execution, i.e., with the processing of the very first element of the outer loop.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
|
The OP (Original Post) showed 2 lines of code. Neither one of which tallied
the number of substitutions properly. My examples showed how to tally that
for an individual line in very explict terms as I thought that was the "problem".
Both of your OP's example lines contain a "foreach" loop. "for" is just a shorthand
for "foreach". "map" is a kind of a foreach loop. My post and Limbic~Region's
are similar in advice. "A foreach is fine", means that you save nothing by
"disguising" the foreach by writing it on a single line. The "foreach" is still there.
Are you saying that a foreach loop would be just as fast as what Limbic-Region was suggesting for an array?
Limbic~Region's "for" statement IS a "foreach loop".
The OP didn't mention anything about running 10K regex'es on 30K lines! Although, it sounds like from
subsequent posts that you have a solution that meets your needs in terms of performance. I don't really understand your application, but if this is some type of word for word substitution situation, a hash based approach would be faster. But that is mute if you are happy with what you have.
| [reply] |
Re: How to count substitutions on an array
by BillKSmith (Monsignor) on Aug 13, 2016 at 16:38 UTC
|
Your first guess was close, but you need scalar context. I chose to let $spliced "double" the matched string. (It did require /ee to accomplish this.
use strict;
use warnings;
my $oldline = qr/xyzzy/;
my $spliced = q/"$1$1"/;
my @array = (
'xyzzy blah xyzzy moreBLah xyzzy asdfouyXYZZY',
'blah xyzzy',
);
my $count;
$count += s/\b($oldline)\b/$spliced/gee for @array;
$" = "\n";
print "@array\n\n";
OUTPUT:
xyzzyxyzzy blah xyzzyxyzzy moreBLah xyzzyxyzzy asdfouyXYZZY
blah xyzzyxyzzy
4
| [reply] [d/l] [select] |
Re: How to count substitutions on an array
by AnomalousMonk (Archbishop) on Aug 13, 2016 at 18:30 UTC
|
If your substitutions are all literal substitutions, i.e., all of the form
'banana' => 'plum'
'xyzzy' => 'whatever'
and never of the form
/f[eio]e+/ => 'something'
and definitely not the sort of thing BillKSmith is doing here, then the following approach might be helpful. If you have many substitutions to make, the alternation of the $search regex may grow quite large, but with the alternation trie optimization of Perl version 5.10, this should not be a problem — unless it gets really big! (My guess is that 10K search-replace pairs could be handled.)
c:\@Work\Perl>perl -le
"use 5.010;
;;
use warnings;
use strict;
;;
my %direct_substitution = (
'apple' => 'PEAR',
'red' => 'YELLOW',
'xyzzy' => 'SOME OTHER THING',
);
;;
my ($search) =
map qr{ \b (?: $_ ) \b }xms,
join q{ | },
keys %direct_substitution
;
print $search;
;;
my @lines = (
'apple xapple apple applex xapplex apple',
'red',
'xxx red apple yyyy xyzzy zz',
);
;;
my $count = 0;
$count += s{ ($search) }{$direct_substitution{$1}}xmsg for @lines;
;;
print qq{substitutions: $count};
print qq{'$_'} for @lines;
"
(?^msx: \b (?: apple | red | xyzzy ) \b )
substitutions: 7
'PEAR xapple PEAR applex xapplex PEAR'
'YELLOW'
'xxx YELLOW PEAR yyyy SOME OTHER THING zz'
Of course, you would go through your @substitutionlist array to build the %direct_substitution hash before processing the @lines array.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
AnomalousMonk,
I appreciate the time you have put into forming such detailed responses. I must admit that I do not fully understand this one, but may come back to review it later to see what I can learn from it. However, in my case, I don't think the substitution list itself can be better arranged.
My situation has, at present, no regex expressions, only pure textual substitutions; however, it has substitutions of varying lengths requiring the longer ones to come first. The longest ones will be several sentences in length.
For example, I might wish to substitute "every one" with "everyone" AFTER I have already substituted "every one of them" with "every one of them" (no change, but the substitution itself will get padded so that it will not match subsequently, thus preserving it from being incorrectly changed to "everyone" in that case). Another case would be to change "sore athirst" to "very thirsty" BEFORE changing all "athirst" instances to "thirsty." In my case, I am ordering the longest substitutions to take place first. I don't think Regex::Assemble would properly handle this. Efficiency must take second priority, though it is important because ongoing edits will require many executions of the script. I am needing to count each substitution so that I can checksum with the original files to ascertain the correct substitutions have indeed taken place.
| [reply] |
|
You say you are quite satisfied with your current solution, but perhaps this may be of interest for future reference.
... no regex expressions, only pure textual substitutions ... substitutions of varying lengths requiring the longer ones to come first. ... no change, but the substitution itself will get padded so that it will not match subsequently ...
This raises a point I had overlooked before. It's possible to add longest-first discrimination when building an alternation. I've also made an attempt to add some acceptance of variable whitespace to the solution. There's also a feature to skip over certain phrases. This avoids the substitution of a substring with itself just to step over it in a possibly expensive no-op. (The weird capitalization is just to emphasize the substituted bits.)
c:\@Work\Perl>perl -le
"use 5.010;
;;
use warnings;
use strict;
;;
my @skip_over = (
'every one of them',
'all in good time',
);
;;
my ($skip) =
map qr{ \b (?: $_ ) \b (*SKIP) (*FAIL) }xms,
join q{ | },
map qr{ \Q$_\E }xms,
sort { length($b) <=> length($a) }
@skip_over
;
print qq{\$skip: $skip \n};
;;
my %direct_substitution = (
'every one' => 'EVERYONE',
'sore athirst' => 'Very Thirsty',
'athirst' => 'THIRSTY',
'all in' => 'GONZO',
);
;;
my ($capture) =
map qr{ \b (?: $_ ) \b }xms,
join q{ | },
map qr{ \Q$_\E }xms,
sort { length($b) <=> length($a) }
keys %direct_substitution
;
print qq{\$capture: $capture \n};
;;
my $line = qq{every one wang chung every one \n}
. qq{of them are sore athirst if not well athirst. \n}
. qq{all in good time we will enjoy all in wrestling. \n}
;
print qq{[[$line]] \n};
;;
$line =~ s{ \s+ }' 'xmsg;
print qq{(($line)) \n};
;;
my $count = 0;
$count += $line =~ s{ ($skip | $capture) }
{$direct_substitution{$1}}xmsg;
;;
print qq{substitutions: $count};
print qq{<<$line>> \n};
"
$skip: (?^msx: \b (?: (?^msx: every\ one\ of\ them ) | (?^msx: all\ in
+\ good\ time ) ) \b (*SKIP) (*
FAIL) )
$capture: (?^msx: \b (?: (?^msx: sore\ athirst ) | (?^msx: every\ one
+) | (?^msx: athirst ) | (?^msx
: all\ in ) ) \b )
[[every one wang chung every one
of them are sore athirst if not well athirst.
all in good time we will enjoy all in wrestling.
]]
((every one wang chung every one of them are sore athirst if not well
+athirst. all in good time we will enjoy all in wrestling. ))
substitutions: 4
<<EVERYONE wang chung every one of them are Very Thirsty if not well T
+HIRSTY. all in good time we will enjoy GONZO wrestling. >>
(Some long output lines have been arbitrarily wrapped when composing this post.) Of course, I intend each $line to be an element in an array over which you're looping.
Update: Of course, it's possible to get rid of the
$line =~ s{ \s+ }' 'xmsg;
whitespace collapsing step and make the code even more whitespace agnostic.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: How to count substitutions on an array
by davido (Cardinal) on Aug 14, 2016 at 00:29 UTC
|
$count = () = s/// is a single expression that repeats for each element in @array. Apparently the final element in @array matches but once. If there were two patterns in the final element of @array matching your pattern, then the could would be 2. On each iteration of your for loop the expression is evaluated against a new $_ input, and $count is replaced. You do not employ an accumulator of any kind.
If you wanted to do this as a single line of code you could do this:
do {$count = () = $s/PATTERN/REPLACEMENT/g; $acc += $count} for @array
+;
Update: Apparently it's not my day. See below. Thanks AnomalousMonk.
In the end, $acc should contain your total number of matches.
Keep in mind that your /e modifier is probably wrong, and maybe a risk. If the replacement comes from somewhere outside of your control, it could be used to pass a destructive command through the Perl interpreter.
| [reply] [d/l] [select] |
|
do {$count = () = $s/PATTERN/REPLACEMENT/g; $acc += $count} for @array;
But s/// returns the number of substitutions made (or the empty string if none), so when list-ified this number (or the empty string) will always be a single-elemment list (update: which in scalar context will always be a $count of 1), so the $acc total will always be the number of elements of the @array array.
c:\@Work\Perl>perl -wMstrict -le
"my @array = qw(PAT PATxPATxPAT PATxPATxPATxPATxPAT xxxxx);
;;
my $acc;
do { my $count = () = s/PAT/REP/g; $acc += $count; } for @array;
;;
print qq{\$acc: $acc \@array: (@array)};
"
$acc: 4 @array: (REP REPxREPxREP REPxREPxREPxREPxREP xxxxx)
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
use strict;
use warnings;
my @data = qw(foozbar foozball zoobeezoobedo);
my $acc = 0;
do {
my $count = s/o/*/g;
$acc+=$count;
} for @data;
print "$acc\n";
# But the above really can just boil down to:
my $count;
$count += s/\*/o/g for @data;
print "$count\n";
# List::Util's reduce is always fun.
use List::Util 'reduce';
my $total = reduce {$a += $b =~ s/o/*/g} 0, @data;
print "$total\n";
Hope this helps (and thanks again for catching the error).
| [reply] [d/l] |
Re: How to count substitutions on an array
by Anonymous Monk on Aug 15, 2016 at 18:51 UTC
|
Two remarks:
- you don't need the /e modifier as the replacement is interpreted (with // at least);
- if you need a sum of numbers, you can use the sum from List::Util.
Use List::Util qw( sum );
$count = sum map s/$foo/$bar/g, @array;
(But use sum0 if @array can be empty and undef won't do.) | [reply] [d/l] [select] |