mbgbioinfo has asked for the wisdom of the Perl Monks concerning the following question:
Dear PerlMonks,
I would like to ask for your wisdom once again. I want to create a program which will open a file, read all its lines into an array and find words whihch they have 4 or more consonants in the row. I created the following program but I have in my terminal all the words (it's like grep is not working).
#!/usr/bin/perl -w
use strict;
use warnings;
open(MYFILE, "fil") || die "$!";
my@fil=<MYFILE>;
close(MYFILE);
chomp(@fil);
my@outcome=grep(/[bBcCdDfFgGhHjJkKlLmMnNpPqQrRsStTvVwWxXzZ]{4,}/, @fil
+);
print @outcome, "\n";
Re: regular expressions
by toolic (Bishop) on Jun 06, 2015 at 20:47 UTC
|
Your grep filters out entire lines. Do you have multiple words on each line? If so, all you need is one word on a line to have 4 consecutive consonants to get a match.
Another way, using a negated character class:
use warnings;
use strict;
use Data::Dumper;
my @words;
while (<DATA>) {
chomp;
push @words, grep { /[^aeiouy]{4}/i } split;
}
print Dumper(\@words);
__DATA__
abc def ghi AAAAAA
jlkm
opqr jhggjyg
123 annn jkjkkj bcdefgh
Prints:
$VAR1 = [
'jlkm',
'jhggjyg',
'jkjkkj'
];
| [reply] [d/l] [select] |
|
I do not think that a negated character class is a good idea for looking for groups of consonants, because, for example, it will pick groups of digits, as shown below under the Perl debugger:
DB<1> $_ ="123 annn jkjkkj bcdefgh 2015 ";
DB<2> push @words, grep { /[^aeiouy]{4}/i } split;
DB<3> x \@words;
0 ARRAY(0x600500b18)
0 'jkjkkj'
1 2015
DB<4>
| [reply] [d/l] |
|
I agree that doubly-negated character classes can be very tricky, but with care, they can be managed to good effect.
I think of it this way: Start with [^\W] which is the same as [\w] (or just \w). As you point out, this includes digits and _ (underscore) as well as alphas. "Subtract", as it were, the digits with [^\W\d] and underscore with [^\W\d_] and you're left with all alpha characters. Then subtract your chosen vowels [^\W\d_aeiouyAEIUOY] and you're done!
c:\@Work\Perl\monks>perl -wMstrict -le
"my $s = '123 annn xyzzy wwwewww xxx9xxx vvv_vvv eieio p pp ppp 2015 v
+wxz vwxzpdq';
;;
my $consonant = qr{ [^\W\d_aeiouyAEIUOY] }xms;
;;
printf qq{'$_' } for $s =~ m{ $consonant{4,} }xmsg;
"
'vwxz' 'vwxzpdq'
All this is easier to manage, IMHO, with POSIX character classes or Unicode properties (if you're brave enough to venture out onto the thin, slippery ice of Unicode); both the following definitions work the same in the code above:
my $consonant = qr{ [^[:^alpha:]aeiouyAEIUOY] }xms;
my $consonant = qr{ [^\P{PosixAlpha}aeiouyAEIUOY] }xms;
YMMV. See perlrecharclass, perluniprops.
(See also the experimental Extended Bracketed Character Classes of version 5.18+; I can't give any examples using these ATM.)
Give a man a fish: <%-(-(-(-<
| [reply] [d/l] [select] |
|
|
Re: regular expressions
by stevieb (Canon) on Jun 06, 2015 at 19:48 UTC
|
Here's one way you could grab out the words. I didn't use grep(), I just did a comparison of each word against the regex directly. I also changed your open() statement to coincide with the recommended way to write them, and used ranges in the regex just so you're aware they are available. Note the 'i' after the regex; that's to make the regex case-insensitive.
#!/usr/bin/perl
use strict;
use warnings;
open my $fh, '<', "input.txt"
or die "Can't open the file: $!";
my @words;
for my $line (<$fh>){
chomp $line;
for my $word (split(/\s+/, $line)){
if ($word =~ /[b-df-hj-np-tv-z]{4}/i){
push @words, $word;
}
}
}
print "$_\n" for @words;
-stevieb | [reply] [d/l] |
Re: regular expressions
by Anonymous Monk on Jun 06, 2015 at 20:43 UTC
|
perl -lne 'print for /\w*[^\WaeiouyAEIOUY]{4,}\w*/g' /usr/share/dict/w
+ords
| [reply] [d/l] |
Re: regular expressions
by Marshall (Canon) on Jun 07, 2015 at 15:36 UTC
|
This is actually pretty good. But...
One flaw is that the regex does not capture multiple tokens that meet the pattern - the paren's below do that and the result is an array. This is called "match global" in Perl lingo.
Another problem is that the regex syntax to match 4 or more is not quite right. {4,} should be {4,}?. The first version would just match 4 at a minimum, but no more. That following ? does matter!
Also to split on "words", space separated tokens, I used the default "split". There are actually 2 different versions of this "default" split. One without parens and one with parens and they work slightly differently when dealing with the beginning of a line. Here, it makes no difference.
I also used a Perl "trick" that can embed comments within the code. This "trick" can also be used to generate documentation in web format. Here I just used it to put my output/comments into the compilable and runnable code. That way I don't have to send you 2 different files, one with code and one with output.
Oh, using the -w switch for a single program like this turns on warnings. The "use warnings;" is not necessary. This also works under Windows. Wow!
I always use strict; and use warnings;. There is a small performance hit for this. But it is almost always worth it. Keep doing that!
#!/usr/bin/perl -w
use strict;
while (<DATA>)
{
print "INPUT LINE: $_";
my @four_constants =
grep{/([bBcCdDfFgGhHjJkKlLmMnNpPqQrRsStTvVwWxXzZ]{4,}?)/g}
split; #the ? allows more than a min of 4!
next unless @four_constants;
print "output: @four_constants", "\n";
}
=EXAMPLE OUTPUT
INPUT LINE: xyy xyz
INPUT LINE: bBbB
output: bBbB
INPUT LINE: abc bacx
INPUT LINE: abca xyzz
INPUT LINE: abCA XXZZ
output: XXZZ
INPUT LINE: xxyyzzz
INPUT LINE: bckz klmx
output: bckz klmx
INPUT LINE: BKZXXXXXXXXXXXX
output: BKZXXXXXXXXXXXX
=cut
__DATA__
xyy xyz
bBbB
abc bacx
abca xyzz
abCA XXZZ
xxyyzzz
bckz klmx
BKZXXXXXXXXXXXX
| [reply] [d/l] [select] |
|
... the regex syntax to match 4 or more is not quite right. {4,} should be {4,}?. The first version would just match 4 at a minimum, but no more.
The quantifier {4,} will match as much as possible (while still allowing an overall match), but at least four of the quantified atom. The quantifier (4,}? will match as little as necessary for an overall match, but at least four of the quantified atom.
c:\@Work\Perl\monks>perl -wMstrict -le
"my @strings = qw(vw vwx vwxz vwxzp vwxzpd vwxzpdq);
;;
my $consonant = qr{ [bBcCdDfFgGhHjJkKlLmMnNpPqQrRsStTvVwWxXzZ] }xms;
;;
for my $s (@strings) {
print qq{'$s'};
print qq{{4,} matched; captured '$1'} if $s =~ m{ ($consonant{4,}
+) }xms;
print qq{{4,}? matched; captured '$1'} if $s =~ m{ ($consonant{4,}?
+) }xms;
print '';
}
"
'vw'
'vwx'
'vwxz'
{4,} matched; captured 'vwxz'
{4,}? matched; captured 'vwxz'
'vwxzp'
{4,} matched; captured 'vwxzp'
{4,}? matched; captured 'vwxz'
'vwxzpd'
{4,} matched; captured 'vwxzpd'
{4,}? matched; captured 'vwxz'
'vwxzpdq'
{4,} matched; captured 'vwxzpdq'
{4,}? matched; captured 'vwxz'
See perlre, perlretut, and perlrequick.
Give a man a fish: <%-(-(-(-<
| [reply] [d/l] [select] |
|
|