comment on

This is actually pretty good. But...

One flaw is that the regex does not capture multiple tokens that meet the pattern - the paren's below do that and the result is an array. This is called "match global" in Perl lingo.

Another problem is that the regex syntax to match 4 or more is not quite right. {4,} should be {4,}?. The first version would just match 4 at a minimum, but no more. That following ? does matter!

Also to split on "words", space separated tokens, I used the default "split". There are actually 2 different versions of this "default" split. One without parens and one with parens and they work slightly differently when dealing with the beginning of a line. Here, it makes no difference.

I also used a Perl "trick" that can embed comments within the code. This "trick" can also be used to generate documentation in web format. Here I just used it to put my output/comments into the compilable and runnable code. That way I don't have to send you 2 different files, one with code and one with output.

Oh, using the -w switch for a single program like this turns on warnings. The "use warnings;" is not necessary. This also works under Windows. Wow!

I always use strict; and use warnings;. There is a small performance hit for this. But it is almost always worth it. Keep doing that!

#!/usr/bin/perl -w
use strict;

while (<DATA>)
{
   print "INPUT LINE: $_";

   my @four_constants = 
      grep{/([bBcCdDfFgGhHjJkKlLmMnNpPqQrRsStTvVwWxXzZ]{4,}?)/g}
      split; #the ? allows more than a min of 4!
   next unless @four_constants;
   print "output: @four_constants", "\n";
}

=EXAMPLE OUTPUT

INPUT LINE: xyy xyz
INPUT LINE: bBbB
output: bBbB
INPUT LINE: abc bacx
INPUT LINE: abca    xyzz
INPUT LINE: abCA    XXZZ
output: XXZZ
INPUT LINE: xxyyzzz
INPUT LINE: bckz  klmx
output: bckz klmx
INPUT LINE: BKZXXXXXXXXXXXX
output: BKZXXXXXXXXXXXX


=cut

__DATA__
xyy xyz
bBbB
abc bacx
abca    xyzz
abCA    XXZZ
xxyyzzz
bckz  klmx
BKZXXXXXXXXXXXX
[download]

In reply to Re: regular expressions by Marshall
in thread regular expressions by mbgbioinfo

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.