hgraf has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, 1st, I am a real beginner, first time ever trying this. So, please, be gentle! :) So I need to take some information out of these files, which contain data about many different proteins. Now, someone have came up with this code:
#!/usr/bin/perl use warnings; while (defined($line = <>)) { chomp( $line ); if ($line =~ /^>(\S+)\s*(.*)/ ) { $id = $1; $description = $2; print "id = $id\n"; print "description = $description\n"; } }
Now, what I need it to do is, add more strings. I need two extra strings. 1st- What does  ($line =~ /^>(\S+)\s*(.*)/ ) is exactly doing? 2nd- How can add more $ to it, such as acession numbers? thanks!!

Replies are listed 'Best First'.
Re: some bioinformatics
by anneli (Pilgrim) on Oct 01, 2011 at 05:02 UTC

    Hi hgraf;

    That's a regular expression, and it's probably the most important bit of your code. The bits in parens are the capture groups; they're $1, $2, ...

    \S+ means 1 or more non-whitespace character(s), and \s* is 0 or more whitespace. It depends on the format of your data, but repeated (\S+)\s*(\S+)\s* sections may be enough.

Re: some bioinformatics
by Marshall (Canon) on Oct 01, 2011 at 05:47 UTC
    It sounds like what you have is a FASTA file.

    Click on: CPAN search for fasta and you will get pages of CPAN modules that have something to do with the FASTA format!

    I wrote one FASTA parser. Use it if you want. I did this to demo one particular type of parsing technique. There are many other ways to write code that parses a FASTA file.
    I don't recommend that you use my code because I think that there is more general purpose code in the way of a CPAN module or "library function" that you can use.

    The basic formula for success here is:
    Explain what you want to do in terms of:
    a) the data that you have now and
    b) the information that you want to produce.
    And then show
    c) your Perl code so far.

    I highly recommend trying to understand how to use the BioPerl modules. But in any event, you have not shown either (a) or (b) above. So it is not possible to discuss (c).

Re: some bioinformatics
by Cristoforo (Curate) on Oct 01, 2011 at 20:04 UTC
    I commented on a similiar problem here and here. If the module discussed, Bio::SeqIO, doesn't solve your problem, maybe this one would, Bio::Seq.
Re: some bioinformatics
by pvaldes (Chaplain) on Oct 01, 2011 at 08:55 UTC
    ($line =~ /^>(\S+)\s*(.*)/ ); # what is exactly doing?

    this line search: a beginning of line, followed by at least one letter or digit that are captured for reuse later as ID, followed or not with one or some white spaces, followed or not by anything. Anything is captured also so you can pass this to a second variable later (Description). Whitespaces and the description block are optional

    i.e, this

    blablabla blobloblo #OK blabla #OK b #OK blabla #NOT, note the whitespaces before

    Now, what I need it to do is, add more strings. I need two extra strings.

    my $first_extra_string = 'I am a cow'; my $second_extra_string = 'moo too';

    How can add more $ to it, such as acession numbers?

    my $adding_more_dollar_signs = "I am rocowfeller, I have a lot of \$\$\$\$\$!";

    Maybe you should explain better what you want to do exactly?

    something like this?

    ($line =~ /^>(\S+)\s*(.*)\s*(.*)\s*(.*)/ ); my $daisy_cow = $3; my $shawn_the_sheep = $4;
Re: some bioinformatics
by Anonymous Monk on Oct 01, 2011 at 02:33 UTC

    Hi,

    As a start, tell us exactly what your starting data looks like, and what bits of it you wish to extract.

    J.C.

Re: some bioinformatics
by Anonymous Monk on Oct 01, 2011 at 02:06 UTC
    he says without looking up Have you read perlintro?