yuvraj_ghaly has asked for the wisdom of the Perl Monks concerning the following question:

I have a Perl script which calculate the molecular weight of the protein. I just want to modify the header section of this script. I want this much of header section


>gi|226694487|sp|Q1DF98.2|ATPF_MYXXD

instead of


>gi|226694487|sp|Q1DF98.2|ATPF_MYXXD RecName: Full=ATP synthase subunit b; AltName: Full=ATP synthase F(0) sector subunit b; AltName: Full=ATPase subunit I; AltName: Full=F-type ATPase subunit b; Short=F-ATPase subunit b

this is the perl script

#!/usr/bin/perl use strict; use warnings; use Encode; for my $file (@ARGV) { open my $fh, '<:encoding(UTF-8)', $file; my $input = join q{}, <$fh>; close $fh; while ( $input =~ /(^>.*?\w?)$([^>]*)/smxg ) { my $name = $1; my $seq = $2; $seq =~ s/\n//smxg; my $mass = calc_mass($seq); print "$name, Molecular weight: $mass\n"; } } sub calc_mass { my $a = shift; my @a = (); my $x = length $a; @a = split q{}, $a; my $b = 0; my %data = ( A=>88, R=>173, D=>132, N=>131, C=>120, E=>146, Q=>145, G=>74, H=>154, I=>130, L=>130, K=>145, M=>198, F=>164, P=>114, S=>104, T=>118, W=>203, Y=>180, V=>116, X=>0,U=>0,Z=>0 ); for my $i( @a ) { $b += $data{$i}; } my $c = sprintf("%0.2f",$b - (18.01528 * ($x - 1))); return $c; }

Replies are listed 'Best First'.
Re: Molecular weight of Protein
by BrowserUk (Patriarch) on Aug 07, 2013 at 05:07 UTC

    You want the bit before the word "RecName:"?

    ... my $header = <$fh>; $header =~ s[ RecName.+$][]; ## Delete everything starting with the sp +ace before RecName

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Molecular weight of Protein
by 2teez (Vicar) on Aug 07, 2013 at 05:40 UTC

    You could also use: "Positive lookahead"

    $_ = '>gi|226694487|sp|Q1DF98.2|ATPF_MYXXD RecName: Full=ATP synthase +subunit b; AltName: Full=ATP synthase F(0) sector subunit b; AltName: + Full=ATPase subunit I; AltName: Full=F-type ATPase subunit b; Short= +F-ATPase subunit b'; print $1 if/(.+?)(?= RecName)/;
    NOTE: Am only showing the use of positive lookahead using the OP.

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me
Re: Molecular weight of Protein
by bioinformatics (Friar) on Aug 07, 2013 at 06:13 UTC
    You could always split on the whitespace as well...
    my ($short_name, $rest) = split(' ', $name);

    Bioinformatics
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Molecular weight of Protein
by Laurent_R (Canon) on Aug 07, 2013 at 06:51 UTC

    Or everything until the first white space:

    my $header = $1 if /^([^\s]+)/;