Grep 1st word of strings in array

imtakinbioinformatic has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Grep 1st word of strings in array by NetWallah (Canon) on Mar 07, 2012 at 15:31 UTC
You have probably misunderstood where the results of a regular expression match end up. In your case, the contents of $line do NOT change, after a RE match attempt. If successful, the match results put all captured data into $1, $2 .. etc. So, instead of pushing $line, try `if ($line =~/^>(\w+)\s(.+)$/) { push(@seqnames, $1);` [download] Alternatively, you could do something like: `if (my ($firstword,$rest_of_it) = $line =~/^>(\w+)\s(.+)$/) { push @segnames,$firstword;` [download] which is more readable, IMHO. Note the use of "my" with (parens), to provide a LIST context for the match results. Update:I just noticed from your commentd 'grep' that you apparently expect the results of the RE inside 'grep' to be placed in the results. This is not the case. To achieve that, you should use 'map' - which TRANSFORMS data. 'grep' merely FILTERS data. P.S. Welcome to the monastery. All great truths begin as blasphemies. ― George Bernard Shaw, writer, Nobel laureate (1856-1950)	[reply] [d/l] [select]
Re: Grep 1st word of strings in array by aaron_baugher (Curate) on Mar 07, 2012 at 22:27 UTC
In addition to the other comments, there are some problems with your regex. This will not match, for instance, because the # character isn't matched by either \w or \s: `#!/usr/bin/env perl use Modern::Perl; my $line = '>foo# bar'; if( $line =~ /^>(\w+)\s(.+)$/ ){ say "Found word $1"; } else { say "No match."; }` [download] So it depends on what you mean by a "word." If you mean just characters matched by \w, then the opposite of that is \W. If you mean "non-whitespace," you should be using \S to match that. Also, that match is greedy, so there's no need to follow it with an opposite match in the first place. These will work just fine, and be much more legible: `$line =~ /^>(\w+)/; # capture as many consecutive \w characters as p +ossible following > $line =~ /^>(\S+)/; # capture as many consecutive non-whitespace char +s as possible following >` [download] Aaron B. My Woefully Neglected Blog, where I occasionally mention Perl.	[reply] [d/l] [select]
Re^2: Grep 1st word of strings in array by Marshall (Canon) on Mar 08, 2012 at 00:03 UTC
We don't know for sure here, but I suspect that it is highly likely that the FASTA format is being used. The OP should give us some input and output lines to make things clear. I suspect that it could also be that the bioPerl modules might be appropriate.	[reply]
Re: Grep 1st word of strings in array by kcott (Archbishop) on Mar 07, 2012 at 15:08 UTC
Try the `first()` function in List::Util. -- Ken	[reply] [d/l]
Re^2: Grep 1st word of strings in array by Marshall (Canon) on Mar 07, 2012 at 17:32 UTC
I may have misunderstood the OP's intent. But it appears to me that a simple regex to get the first "word" after the ">" is all that is required.	[reply]
Re^3: Grep 1st word of strings in array by kcott (Archbishop) on Mar 07, 2012 at 18:00 UTC
You could be right. Some sample input and expected output would be helpful. -- Ken	[reply]
Re: Grep 1st word of strings in array by Marshall (Canon) on Mar 07, 2012 at 17:09 UTC
I looked at your code, deleted the comments and with an adjustment of whitespace we have this: `my @seqnames; my @sequences; foreach $line(@DNA) { if ($line =~ /^>/) { push(@seqnames, $line); } else { push(@sequences, $line); } }` [download] Its hard for me to see how both arrays could wind up being "blank" (no entries). Perhaps this is what you want, but I'm not sure: `my @seqnames; foreach $line(@DNA) { if ($line =~ /^>(\w+)\s/) { push(@seqnames, $1); } }` [download]	[reply] [d/l] [select]
Re: Grep 1st word of strings in array by tobyink (Canon) on Mar 09, 2012 at 11:33 UTC
Something like: `my @firstwords = map { /^>(\w+)/ ? $1 : '#ERR#' } @segnames;` [download] In the above, if the string doesn't start with `/^>(\w+)/` then the string "#ERR#" is used as the first word.	[reply] [d/l] [select]