regular expressions

mbgbioinfo has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regular expressions by toolic (Bishop) on Jun 06, 2015 at 20:47 UTC
Your grep filters out entire lines. Do you have multiple words on each line? If so, all you need is one word on a line to have 4 consecutive consonants to get a match. Another way, using a negated character class: `use warnings; use strict; use Data::Dumper; my @words; while (<DATA>) { chomp; push @words, grep { /[^aeiouy]{4}/i } split; } print Dumper(\@words); __DATA__ abc def ghi AAAAAA jlkm opqr jhggjyg 123 annn jkjkkj bcdefgh` [download] Prints: `$VAR1 = [ 'jlkm', 'jhggjyg', 'jkjkkj' ];` [download]	[reply] [d/l] [select]
Re^2: regular expressions by Laurent_R (Canon) on Jun 07, 2015 at 15:34 UTC
I do not think that a negated character class is a good idea for looking for groups of consonants, because, for example, it will pick groups of digits, as shown below under the Perl debugger: `DB<1> $_ ="123 annn jkjkkj bcdefgh 2015 "; DB<2> push @words, grep { /[^aeiouy]{4}/i } split; DB<3> x \@words; 0 ARRAY(0x600500b18) 0 'jkjkkj' 1 2015 DB<4>` [download]	[reply] [d/l]
Re^3: regular expressions by AnomalousMonk (Archbishop) on Jun 07, 2015 at 18:02 UTC
I agree that doubly-negated character classes can be very tricky, but with care, they can be managed to good effect. I think of it this way: Start with `[^\W]` which is the same as `[\w]` (or just `\w`). As you point out, this includes digits and _ (underscore) as well as alphas. "Subtract", as it were, the digits with `[^\W\d]` and underscore with `[^\W\d_]` and you're left with all alpha characters. Then subtract your chosen vowels `[^\W\d_aeiouyAEIUOY]` and you're done! `c:\@Work\Perl\monks>perl -wMstrict -le "my $s = '123 annn xyzzy wwwewww xxx9xxx vvv_vvv eieio p pp ppp 2015 v +wxz vwxzpdq'; ;; my $consonant = qr{ [^\W\d_aeiouyAEIUOY] }xms; ;; printf qq{'$_' } for $s =~ m{ $consonant{4,} }xmsg; " 'vwxz' 'vwxzpdq'` [download] All this is easier to manage, IMHO, with POSIX character classes or Unicode properties (if you're brave enough to venture out onto the thin, slippery ice of Unicode); both the following definitions work the same in the code above: `my $consonant = qr{ [^[:^alpha:]aeiouyAEIUOY] }xms;` `my $consonant = qr{ [^\P{PosixAlpha}aeiouyAEIUOY] }xms;` YMMV. See perlrecharclass, perluniprops. (See also the experimental Extended Bracketed Character Classes of version 5.18+; I can't give any examples using these ATM.) Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]
Re^4: regular expressions by Laurent_R (Canon) on Jun 07, 2015 at 18:43 UTC
Re^5: regular expressions by Anonymous Monk on Jun 07, 2015 at 19:03 UTC
Re: regular expressions by stevieb (Canon) on Jun 06, 2015 at 19:48 UTC
Here's one way you could grab out the words. I didn't use grep(), I just did a comparison of each word against the regex directly. I also changed your open() statement to coincide with the recommended way to write them, and used ranges in the regex just so you're aware they are available. Note the 'i' after the regex; that's to make the regex case-insensitive. `#!/usr/bin/perl use strict; use warnings; open my $fh, '<', "input.txt" or die "Can't open the file: $!"; my @words; for my $line (<$fh>){ chomp $line; for my $word (split(/\s+/, $line)){ if ($word =~ /[b-df-hj-np-tv-z]{4}/i){ push @words, $word; } } } print "$_\n" for @words;` [download] -stevieb	[reply] [d/l]
Re: regular expressions by Anonymous Monk on Jun 06, 2015 at 20:43 UTC
`perl -lne 'print for /\w[^\WaeiouyAEIOUY]{4,}\w/g' /usr/share/dict/w +ords` [download]	[reply] [d/l]
Re: regular expressions by Marshall (Canon) on Jun 07, 2015 at 15:36 UTC
This is actually pretty good. But... One flaw is that the regex does not capture multiple tokens that meet the pattern - the paren's below do that and the result is an array. This is called "match global" in Perl lingo. Another problem is that the regex syntax to match 4 or more is not quite right. `{4,}` should be `{4,}?`. The first version would just match 4 at a minimum, but no more. That following ? does matter! Also to split on "words", space separated tokens, I used the default "split". There are actually 2 different versions of this "default" split. One without parens and one with parens and they work slightly differently when dealing with the beginning of a line. Here, it makes no difference. I also used a Perl "trick" that can embed comments within the code. This "trick" can also be used to generate documentation in web format. Here I just used it to put my output/comments into the compilable and runnable code. That way I don't have to send you 2 different files, one with code and one with output. Oh, using the -w switch for a single program like this turns on warnings. The "use warnings;" is not necessary. This also works under Windows. Wow! I always use strict; and use warnings;. There is a small performance hit for this. But it is almost always worth it. Keep doing that! #!/usr/bin/perl -w use strict; while (<DATA>) { print "INPUT LINE: $_"; my @four_constants = grep{/([bBcCdDfFgGhHjJkKlLmMnNpPqQrRsStTvVwWxXzZ]{4,}?)/g} split; #the ? allows more than a min of 4! next unless @four_constants; print "output: @four_constants", "\n"; } =EXAMPLE OUTPUT INPUT LINE: xyy xyz INPUT LINE: bBbB output: bBbB INPUT LINE: abc bacx INPUT LINE: abca xyzz INPUT LINE: abCA XXZZ output: XXZZ INPUT LINE: xxyyzzz INPUT LINE: bckz klmx output: bckz klmx INPUT LINE: BKZXXXXXXXXXXXX output: BKZXXXXXXXXXXXX =cut __DATA__ xyy xyz bBbB abc bacx abca xyzz abCA XXZZ xxyyzzz bckz klmx BKZXXXXXXXXXXXX [download]	[reply] [d/l] [select]
Re^2: regular expressions by AnomalousMonk (Archbishop) on Jun 07, 2015 at 16:45 UTC
... the regex syntax to match 4 or more is not quite right. {4,} should be {4,}?. The first version would just match 4 at a minimum, but no more. The quantifier `{4,}` will match as much as possible (while still allowing an overall match), but at least four of the quantified atom. The quantifier `(4,}?` will match as little as necessary for an overall match, but at least four of the quantified atom. c:\@Work\Perl\monks>perl -wMstrict -le "my @strings = qw(vw vwx vwxz vwxzp vwxzpd vwxzpdq); ;; my $consonant = qr{ [bBcCdDfFgGhHjJkKlLmMnNpPqQrRsStTvVwWxXzZ] }xms; ;; for my $s (@strings) { print qq{'$s'}; print qq{{4,} matched; captured '$1'} if $s =~ m{ ($consonant{4,} +) }xms; print qq{{4,}? matched; captured '$1'} if $s =~ m{ ($consonant{4,}? +) }xms; print ''; } " 'vw' 'vwx' 'vwxz' {4,} matched; captured 'vwxz' {4,}? matched; captured 'vwxz' 'vwxzp' {4,} matched; captured 'vwxzp' {4,}? matched; captured 'vwxz' 'vwxzpd' {4,} matched; captured 'vwxzpd' {4,}? matched; captured 'vwxz' 'vwxzpdq' {4,} matched; captured 'vwxzpdq' {4,}? matched; captured 'vwxz' [download] See perlre, perlretut, and perlrequick. Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]