Counting SubStrings, Style Question

se@n has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Counting SubStrings, Style Question by BrowserUk (Patriarch) on Mar 20, 2010 at 19:49 UTC
Update:Corrected per jwkrahn's reply. If `scalar @count` is greater than 2, then there must be more than one copy of substring in parent. If it is 2, there is just one. If it is ~~zero~~ one, it doesn't appear. Update: To clarify per /msg, this is equivalent: `my $substring = 'xxx'; my $parent = 'xxx xxx xxx'; my @count = split($substring,$parent); my $total = scalar @count; print "there are $total $substring in $parent\n"; if( $total == 1 ) { print "None\n"; } elsif( $total == 2 ) { print "Unique\n" } else { print "Duplicates\n"; }` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply] [d/l] [select]
Re^2: Counting SubStrings, Style Question by jwkrahn (Abbot) on Mar 20, 2010 at 20:58 UTC
And if it is 1, it is quantum entangled to hold all values at the same time?	[reply]
Re: Counting SubStrings, Style Question by LanX (Saint) on Mar 20, 2010 at 20:33 UTC
`lanx@nc10-ubuntu:~$ perl -e ' $sub = "xxx"; $par = "xxx xxx xxx"; $count = () = $par =~ /$sub/g; $count == 0 ? print "none" : $count == 1 ? print "unique" : print "duplicates"; ' duplicates` [download] HTH! :) Cheers Rolf UPDATE: beautified code ;) see here how to avoid the ternary cascade.	[reply] [d/l]
Re^2: Counting SubStrings, Style Question by AnomalousMonk (Archbishop) on Mar 20, 2010 at 20:58 UTC
Although the OPed question does not seem to be concerned with overlapping occurrences, they can also be handled with `m{pattern}g`. If the sub-string may contain regex metacharacters, it is also wise to meta-quote. >perl -wMstrict -le "my $substring = 'xxx'; for my $string (@ARGV) { my $non_overlap =()= $string =~ m{ \Q$substring\E }xmsg; my $overlap =()= $string =~ m{ (?= (\Q$substring\E)) }xmsg; print qq{$non_overlap non-overlapping $substring in $string}; print qq{$overlap overlapping $substring in $string}; } " x xx xxx xxxx xxxxx xxxxxx 0 non-overlapping xxx in x 0 overlapping xxx in x 0 non-overlapping xxx in xx 0 overlapping xxx in xx 1 non-overlapping xxx in xxx 1 overlapping xxx in xxx 1 non-overlapping xxx in xxxx 2 overlapping xxx in xxxx 1 non-overlapping xxx in xxxxx 3 overlapping xxx in xxxxx 2 non-overlapping xxx in xxxxxx 4 overlapping xxx in xxxxxx [download] Update: The capturing group in `m{ (?= (\Q$substring\E)) }xmsg` is needed only if there is a concern with what matched rather than only with how many matched as in the OPed question.	[reply] [d/l] [select]
Re^3: Counting SubStrings, Style Question by LanX (Saint) on Mar 20, 2010 at 21:14 UTC
> If the sub-string may contain regex metacharacters, it is also wise to meta-quote. the same when using split. Cheers Rolf	[reply]
Re: Counting SubStrings, Style Question by ikegami (Patriarch) on Mar 20, 2010 at 21:53 UTC
`split`'s purpose to break down a separated list of items. The first argument of `split` should normally match the separator, not what you want to extract. `my $parent = 'xxx xxx xxx'; my $count = my @items = split(/ /, $parent); print "$count\n"; # 3` [download] Since you have a count, all you need to check if there are duplicates is check the count. `if ($count < 1) { print "None\n"; } elsif ($count < 2) { print "Unique\n"; } else { print "$count\n"; }` [download] This is a style/performance question. It sounds from your description that you can have an input like `my $parent = 'xxx yyy xxx zzz';` [download] If so, you should worry about using a working solution first. `my $substring = 'xxx'; my $parent = 'xxx yyy xxx zzz'; my @count = split($substring,$parent); my $total = $#count + 1; print "there are $total $substring in $parent\n"; # XXX 3` [download] This stems from your misuse of `split`. In this case, I'd use `my $substring = 'xxx'; my $parent = 'xxx yyy xxx zzz'; my $count = () = $parent =~ /\Q$substring/; if ($count < 1) { print "None\n"; } elsif ($count < 2) { print "Unique\n"; } else { print "$count\n"; }` [download]	[reply] [d/l] [select]
Re^2: Counting SubStrings, Style Question by Anonymous Monk on Mar 20, 2010 at 23:24 UTC
Your conditions are wrong: `{ my $c = my @c = split 'xxx', 'fred bill xx joe'; print $c };; 1 { my $c = my @c = split 'xxx', 'fred bill xxx joe'; print $c };; 2` [download]	[reply] [d/l]
Re^3: Counting SubStrings, Style Question by ikegami (Patriarch) on Mar 20, 2010 at 23:36 UTC
My conditions are not wrong. `for my $count (0..4) { print("$count: "); if ($count < 1) { print "None\n"; } elsif ($count < 2) { print "Unique\n"; } else { print "$count\n"; } }` [download] `0: None 1: Unique 2: 2 3: 3 4: 4` [download] Or using your examples `my $substr = 'xxx'; for my $parent ( 'fred bill xx joe', 'fred bill xxx joe', ) { my $count = () = $parent =~ /\Q$substr/; if ($count < 1) { print "None\n"; } elsif ($count < 2) { print "Unique\n"; } else { print "$count\n"; } }` [download] `None Unique` [download] It's the count is wrong. My entire post is about how `split` is not the right tool here because it can give the wrong count. That's why my solution doesn't use `split`.	[reply] [d/l] [select]
Re^4: Counting SubStrings, Style Question by Anonymous Monk on Mar 21, 2010 at 00:38 UTC
Re^5: Counting SubStrings, Style Question by ikegami (Patriarch) on Mar 21, 2010 at 05:09 UTC
Re: Counting SubStrings, Style Question by ww (Archbishop) on Mar 20, 2010 at 21:09 UTC
Actually, I think you're OK as is for the moment (but consider the other replies above, as well). Optimization probably won't gain you much for the case in point. However, consider the case in which you try to deal with multiple "parents:" `#!/usr/bin/perl use strict; use warnings; #829841 my @parent =<DATA>; my $substring = 'xxx'; my @count; my $total; for my $parent (@parent) { @count = split /$substring/, $parent; $total = $#count; print "there are $total $substring in $parent\n"; if($parent =~ /$substring[\s\S]$substring/) { print "Duplicates\n" } elsif( $parent !~ /$substring/) { print "None\n"; } else { print "Unique\n" } print "-"x25 ."\n"; @count=""; $total=0; } __DATA__ xxx xxx xxx xxx yyy xxx yyy xxx yyy zyx zyx xx x` [download] Unless you reset your array and counter, you'll encounter off-by-1 inaccuracies. Moving on to your actual question:* the standard technique is to use a hash (keys are unique and values can provide the counter for each variant element in your "`$parent`"). The excellent reply by JSchmitz to Count number of words on a web page? provides code; the "Perl Cookbook" (Christiansen & Torkington, O'Reilly) offers alternates and explanations at pp102-103 (at least in my May 1999 edition) Updated: Took me so long to write this that citing only the first reply above was misleading	[reply] [d/l] [select]
Re: Counting SubStrings, Style Question by juster (Friar) on Mar 21, 2010 at 06:54 UTC
You can also use the simple: `my $count = 0; ++$count while ( $parent =~ /$substring/go );` [download] ...which is faster and uses less memory in my benchmarks than the `=()=` "operator". I also find it more readable. Another reason split is bad: If the string ends with the substring you get 1 less in split's result. `mclapdawg:829841 juster$ perl -E 'say scalar split /xxx/, q{xxx xxx }' 3 mclapdawg:829841 juster$ perl -E 'say scalar split /xxx/, q{xxx xxx}' 2` [download]	[reply] [d/l] [select]
Re^2: Counting SubStrings, Style Question by LanX (Saint) on Mar 21, 2010 at 10:48 UTC
> Another reason split is bad: If the string ends with the substring you get 1 less in split's result. May I correct you? Split isn't "bad" it's wrong! Excellent, the best reason so far! =) > I also find it more readable. Well ... readability is a question of taste and habit. And this approach needs to always initialize $count with 0. But IMHO the o modifier is ~~quite pointless here~~ not really necessary anymore and doesn't make it more readable. :) > ...which is faster and uses less memory in my benchmarks than the =()= "operator". I got a penalty between 50 and 100% which is not sooo dramatic... and this highly depends on the length of the investigated string! So the question is rather, if searching for millions of matches is really a common use case of the OP. Cheers Rolf	[reply]
Re: Counting SubStrings, Style Question by JavaFan (Canon) on Mar 20, 2010 at 22:14 UTC
I'd like to know if there a better way. Better in which way? Faster? (if so, on what OS, what version of Perl, how long are the strings?) More memory friendly? (Again, one which OS, for which version of Perl, how long are the strings?) Less keystrokes? Code I can understand better? Code you can understand better? Code the next programmer can understand better? Code that works with the oldest version of Perl? More features? Need it to send email? Something else? You have code, why aren't you satisfied with it?	[reply]
Re: Counting SubStrings, Style Question by LanX (Saint) on Mar 20, 2010 at 22:15 UTC
TIMTOWT use the rolex operator! =) `perl -e ' for $i (0..3) { $sub = "xxx"; $par = "xxx " x $i; $count = () = $par =~ /\Q$sub\E/g; $msg = qw( none one duplicates ) [ $count<2 ? $count : 2 ]; print $msg,"\n"; }' none one duplicates duplicates` [download] Cheers Rolf	[reply] [d/l]