rlrandallx has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

This is a regex question so forgive me if I should be asking it somewhere else.

Here's my problem: I am searching thru a string/document for a word. If I find it, I wish to print the line no., line and highlight the word (square brackets is ok but I would really like ANSI color.) The issue is when there is more than one hit on the same line. I've read all about global matching but nothing works. Below is my code:

: $line = 'This line has a hit here and a hit there."; $word = 'hit'; $count = 0; while ($line =~ /\b$word\b/gi )#I also tried "/gic" & pos() { $line = "$`".'['."$&".']'."$'"; $count++ } print "$lino $line\n"; : print "$word was found $count times\n";

With color, the one line is "$line = "$`".BLACK \ ON_YELLOW ."$&".RESET."$'"; Anyway, if I replace the 'while' with an 'if' it highlights the first 'hit'. I just can't get the other(s) to highlight. I know it is because I an modifying $line but this structure has worked before so I am trying to find an alternate "sure-fire" way.

Thanks for any help -rlrandallx

Replies are listed 'Best First'.
Re: Highlighting Regex Hits
by toolic (Bishop) on May 15, 2010 at 00:56 UTC
    This will surround all your hit words with square brackets and show the hit count:
    use strict; use warnings; my $line = 'This line has a hit here and a hit there.'; my $word = 'hit'; my $count = $line =~ s/\b($word)\b/[$1]/gi; print "$line\n"; print "$word was found $count times\n"; __END__ This line has a [hit] here and a [hit] there. hit was found 2 times

    The substitution operator returns the number of substitutions made.

Re: Highlighting Regex Hits
by ikegami (Patriarch) on May 15, 2010 at 01:14 UTC

    It is possible to achieve using m//g.

    my $line = 'This line has a hit here and a hit there.'; my $word = 'hit'; my $count = 0; my $hilit = ''; while ($line =~ /(.*?)(?:\b($word)\b|\z)/sgi) { $hilit .= $1; if (defined($2)) { ++$count; $hilit .= "[$2]"; } } print "$hilit\n"; print "$count occurrences of $word\n";

    /c would indeed allow you to simplify the above.

    my $line = 'This line has a hit here and a hit there.'; my $word = 'hit'; my $count = 0; my $hilit = ''; while ($line =~ /(.*?)\b($word)\b/sgci) { $hilit .= "$1[$2]"; ++$count; } $hilit .= substr($line, pos($line)); print "$hilit\n"; print "$count occurrences of $word\n";

    But s///g is much simpler.

    my $line = 'This line has a hit here and a hit there.'; my $word = 'hit'; my $count = (my $hilit = $line) =~ s/\b($word)\b/[$1]/gi; print "$hilit\n"; print "$count occurrences of $word\n";

    Note that if $word can contain characters other than those matched by \w, \b may fail and the contents may be treated as a regex instructions (e.g. $word="foo.bar" would match foolbar).

    Update: Fixed a bug in first snippet.

      OK now. Can anyone get it to work with ANSI COLORS? -rlrandallx
Re: Highlighting Regex Hits
by JavaFan (Canon) on May 15, 2010 at 15:11 UTC
    I don't think anyone has yet explained why the while (//g) {} solution isn't working. The problem lies in the assignment to $line, which resets pos(). So, if there is a hit, the while loop will never terminate, it will find the same hit over and over again, each time adding a new pair of brackets.

    Well, because $line will grow two characters each iteration, eventually the program will run out of memory, terminating the program (and hence, the loop).

Re: Highlighting Regex Hits
by Natanael (Acolyte) on May 15, 2010 at 08:55 UTC
    An alternative way is to use split, which has an (IMO) not often used feature - it can return both what was matched, and what was between:
    my $line = 'This line 1 has a hit here and a hit there.'; my $word = 'hit'; my $count = 0; my $n = 0; my @stuff = split m/($word)/, $line; grep { $n++; if ($n % 2) { print $_; } else { print RED, $_, RESET; $c +ount++; } } @stuff; print "\nFound $count times.\n";
    I found, that this scales better, then running m// or s/// trough while loop, on big strings. Also handy if You need to return modified string (split + join), instead of printing it's parts.

      Couple quibbles:

      1. re "not often used" is actually fairly common; it's been cited in at least two nodes in the past couple days
      2. and re print RED,... my 5.10.1 under *n*x pukes on this (sees "RED" as a filehandle, illegally followed by a comma). Since ikegami has already referred OP to the docs on ANSI, please take this merely as an explanation of why I've used square-brackets rather than colorizing (we won't mention "lazy" here).

      But a more substantive issue (perhaps) lurks in your split where your version will match "hit," "Hitachi," and many others including the vulgar word below (at Note 1):

      #!/usr/bin/perl use strict; use warnings; # 840126 my @line = ('Not here: line 1', 'This line 2 has a hit here and a hit there.', 'hit me, hit me, bust me in line 3!', "Don't throw a shitfit over that hit in line 4.", # *Note + 1 'Line 5: my search-word does not exist here.'); my $word = 'hit'; my $total_count = 0; for my $line(@line) { my $count = 0; my $n = 0; my @stuff = split m/(\b$word\b)/, $line; # grep { $n++; if ($n % 2) { print $_; } else { print RED, $_, RESET +; $count++; } } grep { $n++; if ($n % 2) { print $_; } else { print "\t[ $_ ]"; $c +ount++; } } @stuff; print "\nFound $count times in the preceding line.\n"; $total_count += $count; } print "Total count: $total_count\n"; =head execution: ww@GIG:~/pl_test$ perl 840126.pl Not here: line 1 Found 0 times in the preceding line. This line 2 has a [ hit ] here and a [ hit ] there. Found 2 times in the preceding line. [ hit ] me, [ hit ] me, bust me in line 3! Found 2 times in the preceding line. Don't throw a shitfit over that [ hit ] in line 4. Found 1 times in the preceding line. Line 5: my search-word does not exist here. Found 0 times in the preceding line. Total count: 5 ww@GIG:~/pl_test$ =cut

      IOW, I may have overwritten this, but using the word boundary metacharacter to restrict your matches (as in my line 19) is often a good idea.