Re: Counting SubStrings, Style Question
by BrowserUk (Patriarch) on Mar 20, 2010 at 19:49 UTC
|
Update:Corrected per jwkrahn's reply.
If scalar @count is greater than 2, then there must be more than one copy of substring in parent. If it is 2, there is just one. If it is zero one, it doesn't appear.
Update: To clarify per /msg, this is equivalent:
my $substring = 'xxx';
my $parent = 'xxx xxx xxx';
my @count = split($substring,$parent);
my $total = scalar @count;
print "there are $total $substring in $parent\n";
if( $total == 1 ) { print "None\n"; }
elsif( $total == 2 ) { print "Unique\n" }
else { print "Duplicates\n"; }
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: Counting SubStrings, Style Question
by LanX (Saint) on Mar 20, 2010 at 20:33 UTC
|
lanx@nc10-ubuntu:~$ perl -e '
$sub = "xxx";
$par = "xxx xxx xxx";
$count = () = $par =~ /$sub/g;
$count == 0
? print "none"
: $count == 1
? print "unique"
: print "duplicates";
'
duplicates
HTH! :)
UPDATE: beautified code ;)
see here how to avoid the ternary cascade. | [reply] [d/l] |
|
|
Although the OPed question does not seem to be concerned with overlapping occurrences, they can also be handled with m{pattern}g. If the sub-string may contain regex metacharacters, it is also wise to meta-quote.
>perl -wMstrict -le
"my $substring = 'xxx';
for my $string (@ARGV) {
my $non_overlap =()= $string =~ m{ \Q$substring\E }xmsg;
my $overlap =()= $string =~ m{ (?= (\Q$substring\E)) }xmsg;
print qq{$non_overlap non-overlapping $substring in $string};
print qq{$overlap overlapping $substring in $string};
}
" x xx xxx xxxx xxxxx xxxxxx
0 non-overlapping xxx in x
0 overlapping xxx in x
0 non-overlapping xxx in xx
0 overlapping xxx in xx
1 non-overlapping xxx in xxx
1 overlapping xxx in xxx
1 non-overlapping xxx in xxxx
2 overlapping xxx in xxxx
1 non-overlapping xxx in xxxxx
3 overlapping xxx in xxxxx
2 non-overlapping xxx in xxxxxx
4 overlapping xxx in xxxxxx
Update: The capturing group in m{ (?= (\Q$substring\E)) }xmsg is needed only if there is a concern with what matched rather than only with how many matched as in the OPed question.
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: Counting SubStrings, Style Question
by ikegami (Patriarch) on Mar 20, 2010 at 21:53 UTC
|
split's purpose to break down a separated list of items. The first argument of split should normally match the separator, not what you want to extract.
my $parent = 'xxx xxx xxx';
my $count = my @items = split(/ /, $parent);
print "$count\n"; # 3
Since you have a count, all you need to check if there are duplicates is check the count.
if ($count < 1) { print "None\n"; }
elsif ($count < 2) { print "Unique\n"; }
else { print "$count\n"; }
This is a style/performance question.
It sounds from your description that you can have an input like
my $parent = 'xxx yyy xxx zzz';
If so, you should worry about using a *working* solution first.
my $substring = 'xxx';
my $parent = 'xxx yyy xxx zzz';
my @count = split($substring,$parent);
my $total = $#count + 1;
print "there are $total $substring in $parent\n"; # XXX 3
This stems from your misuse of split.
In this case, I'd use
my $substring = 'xxx';
my $parent = 'xxx yyy xxx zzz';
my $count = () = $parent =~ /\Q$substring/;
if ($count < 1) { print "None\n"; }
elsif ($count < 2) { print "Unique\n"; }
else { print "$count\n"; }
| [reply] [d/l] [select] |
|
|
{ my $c = my @c = split 'xxx', 'fred bill xx joe'; print $c };;
1
{ my $c = my @c = split 'xxx', 'fred bill xxx joe'; print $c };;
2
| [reply] [d/l] |
|
|
My conditions are not wrong.
for my $count (0..4) {
print("$count: ");
if ($count < 1) { print "None\n"; }
elsif ($count < 2) { print "Unique\n"; }
else { print "$count\n"; }
}
0: None
1: Unique
2: 2
3: 3
4: 4
Or using your examples
my $substr = 'xxx';
for my $parent (
'fred bill xx joe',
'fred bill xxx joe',
) {
my $count = () = $parent =~ /\Q$substr/;
if ($count < 1) { print "None\n"; }
elsif ($count < 2) { print "Unique\n"; }
else { print "$count\n"; }
}
None
Unique
It's the count is wrong. My entire post is about how split is not the right tool here because it can give the wrong count. That's why my solution doesn't use split.
| [reply] [d/l] [select] |
|
|
|
|
Re: Counting SubStrings, Style Question
by ww (Archbishop) on Mar 20, 2010 at 21:09 UTC
|
Actually, I think you're OK as is for the moment (but consider the other replies above, as well). Optimization probably won't gain you much for the case in point.
However, consider the case in which you try to deal with multiple "parents:"
#!/usr/bin/perl
use strict;
use warnings;
#829841
my @parent =<DATA>;
my $substring = 'xxx';
my @count;
my $total;
for my $parent (@parent) {
@count = split /$substring/, $parent;
$total = $#count;
print "there are $total $substring in $parent\n";
if($parent =~ /$substring[\s\S]*$substring/) {
print "Duplicates\n"
}
elsif( $parent !~ /$substring/) {
print "None\n";
}
else {
print "Unique\n"
}
print "-"x25 ."\n";
@count=""; $total=0;
}
__DATA__
xxx xxx xxx
xxx yyy xxx
yyy xxx yyy
zyx zyx
xx
x
Unless you reset your array and counter, you'll encounter off-by-1 inaccuracies.
Moving on to your actual question: the standard technique is to use a hash (keys are unique and values can provide the counter for each variant element in your "$parent").
The excellent reply by JSchmitz to Count number of words on a web page? provides code; the "Perl Cookbook" (Christiansen & Torkington, O'Reilly) offers alternates and explanations at pp102-103 (at least in my May 1999 edition)
Updated: Took me so long to write this that citing only the first reply above was misleading | [reply] [d/l] [select] |
Re: Counting SubStrings, Style Question
by juster (Friar) on Mar 21, 2010 at 06:54 UTC
|
You can also use the simple:
my $count = 0;
++$count while ( $parent =~ /$substring/go );
...which is faster and uses less memory in my benchmarks than the =()= "operator". I also find it more readable.
Another reason split is bad: If the string ends with the substring you get 1 less in split's result.
mclapdawg:829841 juster$ perl -E 'say scalar split /xxx/, q{xxx xxx }'
3
mclapdawg:829841 juster$ perl -E 'say scalar split /xxx/, q{xxx xxx}'
2
| [reply] [d/l] [select] |
|
|
> Another reason split is bad: If the string ends with the substring you get 1 less in split's result.
May I correct you?
Split isn't "bad" it's wrong!
Excellent, the best reason so far! =)
> I also find it more readable.
Well ... readability is a question of taste and habit. And this approach needs to always initialize $count with 0.
But IMHO the o modifier is quite pointless here not really necessary anymore and doesn't make it more readable. :)
> ...which is faster and uses less memory in my benchmarks than the =()= "operator".
I got a penalty between 50 and 100% which is not sooo dramatic... and this highly depends on the length of the investigated string!
So the question is rather, if searching for millions of matches is really a common use case of the OP.
| [reply] |
Re: Counting SubStrings, Style Question
by JavaFan (Canon) on Mar 20, 2010 at 22:14 UTC
|
I'd like to know if there a better way.
Better in which way? Faster? (if so, on what OS, what version of Perl, how long are the strings?) More memory friendly? (Again, one which OS, for which version of Perl, how long are the strings?) Less keystrokes? Code *I* can understand better? Code *you* can understand better? Code *the next programmer* can understand better? Code that works with the oldest version of Perl? More features? Need it to send email? Something else?
You have code, why aren't you satisfied with it?
| [reply] |
Re: Counting SubStrings, Style Question
by LanX (Saint) on Mar 20, 2010 at 22:15 UTC
|
perl -e '
for $i (0..3) {
$sub = "xxx";
$par = "xxx " x $i;
$count = () = $par =~ /\Q$sub\E/g;
$msg = qw( none one duplicates ) [ $count<2 ? $count : 2 ];
print $msg,"\n";
}'
none
one
duplicates
duplicates
| [reply] [d/l] |