in reply to Re: Regex help \b & \Q
in thread Regex help \b & \Q

It seems there is a problem.

when we give $kw = 'C++'; or $kw = 'C'; it fails to show the correct answer.

Consider the below scenario,

my $kw = 'C'; # or use C++ my $title = ".net C .NET Cobol .NET C++ .NET .NETER Perl IT x.NET .net +"; my $count = () = $title =~ m{ (?<! \S) \Q$kw\E \b }xmsig; print $count; die;

Here C should have a value of 1 and C++ also should have a value of 1 when checked with the corresponding $kw C and C++ but they are showing wrong answers.

Replies are listed 'Best First'.
Re^3: Regex help \b & \Q
by AnomalousMonk (Archbishop) on Apr 14, 2016 at 11:45 UTC

    If whitespace or start/end of string is going to be the delimiter, I would use:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $title = 'C .NET Cobol .NET .NET .NETER Perl C++ C+ xC++ C+++ C++x + xC x.NET .net'; ;; for my $kw (qw(.NET C C++)) { my $count = () = $title =~ m{ (?<! \S) \Q$kw\E (?! \S) }xmsig; print qq{'$kw' $count}; } " '.NET' 4 'C' 1 'C++' 1


    Give a man a fish:  <%-{-{-{-<

      Thanks for this one Anonymous Monk. It is working for me.

      I have another scenario now,

      my $kw = '.Net'; my $title = ".net, .net; C .NET Cobol .NET C++ .NET .NETER Perl IT x.N +ET .net";

      The answer should be 6 in this case. Means it has to consider the coma and semicolon cases too, if a $kw is followed by coma or semicolon, it should count.

        Better:

        my $count = () = $title =~ m{ (?:^|\s)\K \Q$kw\E (?! [^\s,;] ) }xig;
        • (?! [^\s,;] ) is more efficient than (?: (?! \S) | (?= [,;]).
        • (?:^|\s)\K is more efficient than (?<! \S ).
        • The s and m flags weren't necessary.

        Try this:

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = '.net, .net; C .NET Cobol .NET C++ .NET .NETER Perl IT x.NET +.net'; ;; for my $kw (qw(.NET C C++)) { my $count = () = $s =~ m{ (?<! \S) \Q$kw\E (?: (?! \S) | (?= [,;])) + }xmsig; print qq{'$kw' $count}; } " '.NET' 6 'C' 1 'C++' 1
        There might be something a bit more elegant than  (?: (?! \S) | (?= [,;])) for the end delimiter, but it was a quick fix.


        Give a man a fish:  <%-{-{-{-<

Re^3: Regex help \b & \Q (updated)
by haukex (Archbishop) on Apr 14, 2016 at 11:14 UTC

    Hi Anonymous,

    \b matches between a \w (in this case "C") and a \W (in this case "+"). If your keywords are always separated by whitespace, something like the following might work. It would be helpful if you could post several example inputs with their expected outputs.

    Update: The following does not work correctly if the input string contains multiple instances of $kw separated by a single \s. Thanks to AnomalousMonk for catching that!

    my $kw = 'C'; # or use C++ my $title = ".net C .NET Cobol .NET C++ .NET .NETER Perl IT x.NET .net +"; my $count = () = $title =~ m{ (?:^|\s) \Q$kw\E (?:\s|$) }xmsig; print "$count\n"; # prints "1" for both C and C++

    Hope this helps,
    -- Hauke D

      The problem with using  (?:^|\s) and  (?:\s|$) as delimiter patterns is that  \s in the middle of a string requires and consumes a whitespace character. If only a single whitespace character separates patterns that are intended to match, some matches will be missed:

      c:\@Work\Perl\monks>perl -wMstrict -le "my $title = 'C C C C++ C++ C++ .NET .NET .NET'; ;; for my $kw (qw(.NET C C++)) { my $count = () = $title =~ m{ (?:^|\s) \Q$kw\E (?:\s|$) }xmsig; print qq{'$kw' $count}; } " '.NET' 2 'C' 2 'C++' 2


      Give a man a fish:  <%-{-{-{-<

      Thanks For your reply Haukex It works for most cases but failed for the one which was put forward by AnomalousMonk.