imtakinbioinformatic has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have a question that hopefully is pretty simple. Below I am creating an array of headers, but I want to make an array that only keeps the 1st word of each element. (Each element in the array starts with a > because they are different contigs.) I tried both ways that are commented out below, but I'm just getting an empty array with both of them. Tips? Thanks!

foreach $line(@DNA) { # if ($line =~/^>(\w+)\s(.+)$/) { if($line =~/^>/){ push(@seqnames, $line); #@firstword = grep(/^>(\w+)\s(.+)/,@segnames); } else {push(@sequences, $line);} }

Replies are listed 'Best First'.
Re: Grep 1st word of strings in array
by NetWallah (Canon) on Mar 07, 2012 at 15:31 UTC
    You have probably misunderstood where the results of a regular expression match end up.

    In your case, the contents of $line do NOT change, after a RE match attempt.

    If successful, the match results put all captured data into $1, $2 .. etc.

    So, instead of pushing $line, try

    if ($line =~/^>(\w+)\s(.+)$/) { push(@seqnames, $1);
    Alternatively, you could do something like:
    if (my ($firstword,$rest_of_it) = $line =~/^>(\w+)\s(.+)$/) { push @segnames,$firstword;
    which is more readable, IMHO. Note the use of "my" with (parens), to provide a LIST context for the match results.

    Update:I just noticed from your commentd 'grep' that you apparently expect the results of the RE inside 'grep' to be placed in the results.
    This is not the case. To achieve that, you should use 'map' - which TRANSFORMS data. 'grep' merely FILTERS data.

    P.S. Welcome to the monastery.

                 All great truths begin as blasphemies.
                       ― George Bernard Shaw, writer, Nobel laureate (1856-1950)

Re: Grep 1st word of strings in array
by aaron_baugher (Curate) on Mar 07, 2012 at 22:27 UTC

    In addition to the other comments, there are some problems with your regex. This will not match, for instance, because the # character isn't matched by either \w or \s:

    #!/usr/bin/env perl use Modern::Perl; my $line = '>foo# bar'; if( $line =~ /^>(\w+)\s(.+)$/ ){ say "Found word $1"; } else { say "No match."; }

    So it depends on what you mean by a "word." If you mean just characters matched by \w, then the opposite of that is \W. If you mean "non-whitespace," you should be using \S to match that. Also, that match is greedy, so there's no need to follow it with an opposite match in the first place. These will work just fine, and be much more legible:

    $line =~ /^>(\w+)/; # capture as many consecutive \w characters as p +ossible following > $line =~ /^>(\S+)/; # capture as many consecutive non-whitespace char +s as possible following >

    Aaron B.
    My Woefully Neglected Blog, where I occasionally mention Perl.

      We don't know for sure here, but I suspect that it is highly likely that the FASTA format is being used.

      The OP should give us some input and output lines to make things clear. I suspect that it could also be that the bioPerl modules might be appropriate.

Re: Grep 1st word of strings in array
by kcott (Archbishop) on Mar 07, 2012 at 15:08 UTC
      I may have misunderstood the OP's intent. But it appears to me that a simple regex to get the first "word" after the ">" is all that is required.

        You could be right. Some sample input and expected output would be helpful.

        -- Ken

Re: Grep 1st word of strings in array
by Marshall (Canon) on Mar 07, 2012 at 17:09 UTC
    I looked at your code, deleted the comments and with an adjustment of whitespace we have this:
    my @seqnames; my @sequences; foreach $line(@DNA) { if ($line =~ /^>/) { push(@seqnames, $line); } else { push(@sequences, $line); } }
    Its hard for me to see how both arrays could wind up being "blank" (no entries).

    Perhaps this is what you want, but I'm not sure:

    my @seqnames; foreach $line(@DNA) { if ($line =~ /^>(\w+)\s/) { push(@seqnames, $1); } }
Re: Grep 1st word of strings in array
by tobyink (Canon) on Mar 09, 2012 at 11:33 UTC

    Something like:

    my @firstwords = map { /^>(\w+)/ ? $1 : '#ERR#' } @segnames;

    In the above, if the string doesn't start with /^>(\w+)/ then the string "#ERR#" is used as the first word.