se@n has asked for the wisdom of the Perl Monks concerning the following question:

use strict; use warnings; my $substring = 'xxx'; my $parent = 'xxx xxx xxx'; my @count = split($substring,$parent); my $total = $#count + 1; print "there are $total $substring in $parent\n"; if($parent =~ /$substring[\s\S]*$substring/) { print "Duplicates\n" } elsif( $parent !~ /$substring/) { print "None\n"; } else { print "Unique\n" }

This is a style/performance question. What's the best way to count the number of items in a string and test for uniqueness? I'm not asking what it does? I'd like to know if there a better way. Thanks.

Replies are listed 'Best First'.
Re: Counting SubStrings, Style Question
by BrowserUk (Patriarch) on Mar 20, 2010 at 19:49 UTC

    Update:Corrected per jwkrahn's reply.

    If scalar @count is greater than 2, then there must be more than one copy of substring in parent. If it is 2, there is just one. If it is zero one, it doesn't appear.

    Update: To clarify per /msg, this is equivalent:

    my $substring = 'xxx'; my $parent = 'xxx xxx xxx'; my @count = split($substring,$parent); my $total = scalar @count; print "there are $total $substring in $parent\n"; if( $total == 1 ) { print "None\n"; } elsif( $total == 2 ) { print "Unique\n" } else { print "Duplicates\n"; }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      And if it is 1, it is quantum entangled to hold all values at the same time?

Re: Counting SubStrings, Style Question
by LanX (Saint) on Mar 20, 2010 at 20:33 UTC
    lanx@nc10-ubuntu:~$ perl -e ' $sub = "xxx"; $par = "xxx xxx xxx"; $count = () = $par =~ /$sub/g; $count == 0 ? print "none" : $count == 1 ? print "unique" : print "duplicates"; ' duplicates

    HTH! :)

    Cheers Rolf

    UPDATE: beautified code ;)

    see here how to avoid the ternary cascade.

      Although the OPed question does not seem to be concerned with overlapping occurrences, they can also be handled with m{pattern}g. If the sub-string may contain regex metacharacters, it is also wise to meta-quote.

      >perl -wMstrict -le "my $substring = 'xxx'; for my $string (@ARGV) { my $non_overlap =()= $string =~ m{ \Q$substring\E }xmsg; my $overlap =()= $string =~ m{ (?= (\Q$substring\E)) }xmsg; print qq{$non_overlap non-overlapping $substring in $string}; print qq{$overlap overlapping $substring in $string}; } " x xx xxx xxxx xxxxx xxxxxx 0 non-overlapping xxx in x 0 overlapping xxx in x 0 non-overlapping xxx in xx 0 overlapping xxx in xx 1 non-overlapping xxx in xxx 1 overlapping xxx in xxx 1 non-overlapping xxx in xxxx 2 overlapping xxx in xxxx 1 non-overlapping xxx in xxxxx 3 overlapping xxx in xxxxx 2 non-overlapping xxx in xxxxxx 4 overlapping xxx in xxxxxx

      Update: The capturing group in  m{ (?= (\Q$substring\E)) }xmsg is needed only if there is a concern with what matched rather than only with how many matched as in the OPed question.

        > If the sub-string may contain regex metacharacters, it is also wise to meta-quote.

        the same when using split.

        Cheers Rolf

Re: Counting SubStrings, Style Question
by ikegami (Patriarch) on Mar 20, 2010 at 21:53 UTC

    split's purpose to break down a separated list of items. The first argument of split should normally match the separator, not what you want to extract.

    my $parent = 'xxx xxx xxx'; my $count = my @items = split(/ /, $parent); print "$count\n"; # 3

    Since you have a count, all you need to check if there are duplicates is check the count.

    if ($count < 1) { print "None\n"; } elsif ($count < 2) { print "Unique\n"; } else { print "$count\n"; }

    This is a style/performance question.

    It sounds from your description that you can have an input like

    my $parent = 'xxx yyy xxx zzz';

    If so, you should worry about using a *working* solution first.

    my $substring = 'xxx'; my $parent = 'xxx yyy xxx zzz'; my @count = split($substring,$parent); my $total = $#count + 1; print "there are $total $substring in $parent\n"; # XXX 3

    This stems from your misuse of split.

    In this case, I'd use

    my $substring = 'xxx'; my $parent = 'xxx yyy xxx zzz'; my $count = () = $parent =~ /\Q$substring/; if ($count < 1) { print "None\n"; } elsif ($count < 2) { print "Unique\n"; } else { print "$count\n"; }

      Your conditions are wrong:

      { my $c = my @c = split 'xxx', 'fred bill xx joe'; print $c };; 1 { my $c = my @c = split 'xxx', 'fred bill xxx joe'; print $c };; 2
        My conditions are not wrong.
        for my $count (0..4) { print("$count: "); if ($count < 1) { print "None\n"; } elsif ($count < 2) { print "Unique\n"; } else { print "$count\n"; } }
        0: None 1: Unique 2: 2 3: 3 4: 4
        Or using your examples
        my $substr = 'xxx'; for my $parent ( 'fred bill xx joe', 'fred bill xxx joe', ) { my $count = () = $parent =~ /\Q$substr/; if ($count < 1) { print "None\n"; } elsif ($count < 2) { print "Unique\n"; } else { print "$count\n"; } }
        None Unique

        It's the count is wrong. My entire post is about how split is not the right tool here because it can give the wrong count. That's why my solution doesn't use split.

Re: Counting SubStrings, Style Question
by ww (Archbishop) on Mar 20, 2010 at 21:09 UTC

    Actually, I think you're OK as is for the moment (but consider the other replies above, as well). Optimization probably won't gain you much for the case in point.

    However, consider the case in which you try to deal with multiple "parents:"

    #!/usr/bin/perl use strict; use warnings; #829841 my @parent =<DATA>; my $substring = 'xxx'; my @count; my $total; for my $parent (@parent) { @count = split /$substring/, $parent; $total = $#count; print "there are $total $substring in $parent\n"; if($parent =~ /$substring[\s\S]*$substring/) { print "Duplicates\n" } elsif( $parent !~ /$substring/) { print "None\n"; } else { print "Unique\n" } print "-"x25 ."\n"; @count=""; $total=0; } __DATA__ xxx xxx xxx xxx yyy xxx yyy xxx yyy zyx zyx xx x

    Unless you reset your array and counter, you'll encounter off-by-1 inaccuracies.

    Moving on to your actual question: the standard technique is to use a hash (keys are unique and values can provide the counter for each variant element in your "$parent").

    The excellent reply by JSchmitz to Count number of words on a web page? provides code; the "Perl Cookbook" (Christiansen & Torkington, O'Reilly) offers alternates and explanations at pp102-103 (at least in my May 1999 edition)

    Updated: Took me so long to write this that citing only the first reply above was misleading

Re: Counting SubStrings, Style Question
by juster (Friar) on Mar 21, 2010 at 06:54 UTC
    You can also use the simple:
    my $count = 0; ++$count while ( $parent =~ /$substring/go );

    ...which is faster and uses less memory in my benchmarks than the =()= "operator". I also find it more readable.

    Another reason split is bad: If the string ends with the substring you get 1 less in split's result.

    mclapdawg:829841 juster$ perl -E 'say scalar split /xxx/, q{xxx xxx }' 3 mclapdawg:829841 juster$ perl -E 'say scalar split /xxx/, q{xxx xxx}' 2
      > Another reason split is bad: If the string ends with the substring you get 1 less in split's result.

      May I correct you?

      Split isn't "bad" it's wrong!

      Excellent, the best reason so far! =)

      > I also find it more readable.

      Well ... readability is a question of taste and habit. And this approach needs to always initialize $count with 0.

      But IMHO the o modifier is quite pointless here not really necessary anymore and doesn't make it more readable. :)

      > ...which is faster and uses less memory in my benchmarks than the =()= "operator".

      I got a penalty between 50 and 100% which is not sooo dramatic... and this highly depends on the length of the investigated string!

      So the question is rather, if searching for millions of matches is really a common use case of the OP.

      Cheers Rolf

Re: Counting SubStrings, Style Question
by JavaFan (Canon) on Mar 20, 2010 at 22:14 UTC
    I'd like to know if there a better way.
    Better in which way? Faster? (if so, on what OS, what version of Perl, how long are the strings?) More memory friendly? (Again, one which OS, for which version of Perl, how long are the strings?) Less keystrokes? Code *I* can understand better? Code *you* can understand better? Code *the next programmer* can understand better? Code that works with the oldest version of Perl? More features? Need it to send email? Something else?

    You have code, why aren't you satisfied with it?

Re: Counting SubStrings, Style Question
by LanX (Saint) on Mar 20, 2010 at 22:15 UTC
    TIMTOWT use the rolex operator! =)
    perl -e ' for $i (0..3) { $sub = "xxx"; $par = "xxx " x $i; $count = () = $par =~ /\Q$sub\E/g; $msg = qw( none one duplicates ) [ $count<2 ? $count : 2 ]; print $msg,"\n"; }' none one duplicates duplicates

    Cheers Rolf