Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hi
if i have 100 sentences in a file, every sentence in a line, and i want to know which sentences contains exactly 4 r's by using regular expressions
i have tried r.*?r.*?r.*?r but this will capture also the line with more than 4 r's
any ideas please
  • Comment on extract sentences with certain number of a character

Replies are listed 'Best First'.
Re: extract sentences with certain number of a character
by mscharrer (Hermit) on Apr 27, 2008 at 09:32 UTC
    Don't use regexes for this, use the transliteration operator tr/ / /. It's function is actually to translates any given character in the first list to the corresponding character of the second. So tr/abc/efg/ would change every a to e, every b to f, etc. It returns the total number of characters found.
    So to check if the line has exactly four 'r' just use tr/r/r/ or tr/r// which is the same.

    E.g.:

    while (my $line = <FILE>) { if ( ($line =~ tr/r//) == 4 ) { print $line; } }
Re: extract sentences with certain number of a character
by linuxer (Curate) on Apr 27, 2008 at 09:23 UTC

    You shouldn't match for any character (.) but for any non-r character ( [^r] ) between your wanted 'r'.

    Added: Additionally you should use anchors in your regex, otherwise it will match any line, in which your regex can match 4 r, even if there are more r's in that line.

      code:

      /^[^r]*r[^r]*r[^r]*r[^r]*r[^r]*$/
Re: extract sentences with certain number of a character
by ww (Archbishop) on Apr 27, 2008 at 14:20 UTC

    Or (this horse ain't quite dead, yet),

    #!/usr/bin/perl use strict; use warnings; my(@lines, $line); print "\n\tUsing tr\n"; @lines = <DATA>; for $line(@lines) { chomp $line; my $line1 = $line; if ( ($line1 =~ tr/[r|R]/r/) == 4 ) { # Case insensitive print "$line1\n"; } } print "\n\t using match:\n"; for $line(@lines) { if ( $line =~ / ^ # start at beginning of line ([^rR]*) # 0 or more non-r chars r # match "r" ([^rR]*) # negative lookahead, 0 or more non-r chars r ([^rR]*) # alternately, could be ?=[^r]* r ([^rR]*) r ([^rR]*) $ # end of line /ix) { # extended syntax; end of match; end of conditio +n print "$line\n" } } print "\n\t using match2:\n"; for $line(@lines) { if ( $line =~ /^([^r]*|[^R]*)r([^r]*|[^R]*)r([^r]*|[^R]*)r([^r]*|[ +^R]*)r([^r]*|[^R]*)$/i) { print "$line\n"; } } print "\n and the data is:\n"; for $line(@lines) { print $line . "\n"; } print "\nDone\n"; __DATA__ There are four "r"s in this sentence. 4 This one has how many? 0 None in this. 0 But where can there be as many words with 'r's as there are here? 7 Still, rrrr makes no sense. 4 Drill for sentences with multiple instances of "are" and "were" regula +rly. 5 There are four "r"s in this sentence. 4 This one has how many? 0 None in this. 0 But where can there be as many words with 'r's as there are here? 7 Still, rrrr makes no sense. 4 Drill for sentences with multiple instances of "are" and "were" regula +rly. 5 Argh. Right you are, Randy! 4 matches if insensitive (but only 2 match + when case sensitive).

    output

    Using tr There are four "r"s in this sentence. 4 Still, rrrr makes no sense. 4 There are four "r"s in this sentence. 4 Still, rrrr makes no sense. 4 Argh. right you are, randy! 4 matches if insensitive but only 2 match +when case sensitive. using match: There are four "r"s in this sentence. 4 Still, rrrr makes no sense. 4 There are four "r"s in this sentence. 4 Still, rrrr makes no sense. 4 Argh. Right you are, Randy! 4 matches if insensitive but only 2 match +when case sensitive. using match2: There are four "r"s in this sentence. 4 Still, rrrr makes no sense. 4 There are four "r"s in this sentence. 4 Still, rrrr makes no sense. 4 Argh. Right you are, Randy! 4 matches if insensitive but only 2 match +when case sensitive. and the data is: ....

    Re-updated; Found copied wrong right code. duh!

    And, of course, this being Perl, there are many other ways, too. One could (if one wished to besmirch the virtue of laziness, capture each [rR], push them to an array, and count the elements. But bottom line: tr/r/r/ (preserving the original data, case sensitive) or tr/[r|R]/r/ (counting the upper case "R"s, but losing the original case) may be preferred.

      If I remember correctly (perldoc perlop):

      tr/// does not support character classes, so tr/[r|R]// does count any of the characters [, r, |, R and ].

      And character classes don't need | for the alternatives; so [rR] would be enough for a class matching r and R.

      tr/rR// should be enough to count the occurences of r's and R's.

Re: extract sentences with certain number of a character
by FunkyMonk (Bishop) on Apr 27, 2008 at 09:20 UTC
    Homework?

    Why don't you show us what you've tried so far, or ask us for help on what you're stuck with?

    Have you read perlretut?

    Update: It's obviously too early for me. Sorry, AnonyMonk

    Update^2: Have a look Using character classes particularly the bit about negated character classes.


    Unless I state otherwise, my code all runs with strict and warnings
Re: extract sentences with certain number of a character
by toolic (Bishop) on Apr 27, 2008 at 18:18 UTC
    Yet another way to skin this cat (sorry for extending ww's animal metaphors...).

    When I hear "count characters", grep in scalar context is usually the first thing to enter my brain.

    use warnings; use strict; while (<DATA>) { print if ((grep /r/, split //) == 4); } __DATA__ r 3 r r rr 4 r r r r nada nothing zilch to many rr rrrr rrr 's

    Not sure how this compares to others' regex and tr solution's from a performance standpoint (and I'm too Lazy to benchmark it :)