Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have a Swissprot flat file and I want to parse the DE row such that +it all comes into one single line..... I have written a code for that but it seems its nor parsing the multip +le lines..........well I am also a bit new to perl .... I am posting the structure of the file, which I have to parse.. AB AAAA_BBBBB DE AC2-(EC 2.7.00.1) (Adaptor-associated DE protein 1). // ID CCCCC_DDDDD DE Serine/threonine-protein kinase (EC 2.7.99.1) (Tyrosine DE binding protein) (p35BP). PR AAAAAAAAAAAAAAAAAAAAAAA. // ID RRRRR_GGGGG AC Q6Q8; DE Serine/threonine-aaaaaa kinase (Tyrosine DE kinase 1) (Apoptosis-associated tyrosine kinase) (AATYK) (Brain DE binding protein) (p35BP). PR xxxxxxxxxxxxxx. CD zzzzzzzzzzzzzz. // thanks KK

Replies are listed 'Best First'.
Re: swissprot parsing DE
by Corion (Patriarch) on May 31, 2007 at 06:57 UTC

    You don't show us the code you have so it's a bit hard to give you concrete advice on how to change your code. I would approach the problem with something like the following code:

    #!/usr/bin/perl -w use strict; my $last_tag = ''; my $buffer; while (<DATA>) { if (! /^(..)\s+(.*)$/) { warn "Malformed line >$_<"; next; }; my ($tag,$content) = ($1,$2); if ($tag ne $last_tag) { print "Collected: $tag, $buffer\n"; $buffer = $content; } else { $buffer .= $content; }; }; __DATA__ AB AAAA_BBBBB DE AC2-(EC 2.7.00.1) (Adaptor-associated DE protein 1). // ID CCCCC_DDDDD DE Serine/threonine-protein kinase (EC 2.7.99.1) (Tyrosine DE binding protein) (p35BP). PR AAAAAAAAAAAAAAAAAAAAAAA. // ID RRRRR_GGGGG AC Q6Q8; DE Serine/threonine-aaaaaa kinase (Tyrosine DE kinase 1) (Apoptosis-associated tyrosine kinase) (AATYK) (Brain DE binding protein) (p35BP). PR xxxxxxxxxxxxxx. CD zzzzzzzzzzzzzz. //

    Please only put your code and data in between <code>..</code> tags.

Re: swissprot parsing DE
by monkey_boy (Priest) on May 31, 2007 at 08:47 UTC
    This is one of the areas BioPerl can be usefull, have a look at Bio::SeqIO



    This is not a Signature...
Re: swissprot parsing DE
by BrowserUk (Patriarch) on May 31, 2007 at 08:37 UTC

    Maybe?

    #! perl -slw use strict; $/ = '//'; while( <DATA> ) { s[(?:\A|\n)\S{2}\s*][ ]smg; s[^\s+][]; print; } __DATA__ AB AAAA_BBBBB DE AC2-(EC 2.7.00.1) (Adaptor-associated DE protein 1). // ID CCCCC_DDDDD DE Serine/threonine-protein kinase (EC 2.7.99.1) (Tyrosine DE binding protein) (p35BP). PR AAAAAAAAAAAAAAAAAAAAAAA. // ID RRRRR_GGGGG AC Q6Q8; DE Serine/threonine-aaaaaa kinase (Tyrosine DE kinase 1) (Apoptosis-associated tyrosine kinase) (AATYK) (Brain DE binding protein) (p35BP). PR xxxxxxxxxxxxxx. CD zzzzzzzzzzzzzz. //

    Gives

    C:\test>junk AAAA_BBBBB AC2-(EC 2.7.00.1) (Adaptor-associated protein 1). CCCCC_DDDDD Serine/threonine-protein kinase (EC 2.7.99.1) (Tyrosine bi +nding protein) (p35BP). AAAAAAAAAAAAAAAAAAAAAAA. RRRRR_GGGGG Q6Q8; Serine/threonine-aaaaaa kinase (Tyrosine kinase 1) ( +Apoptosis-associated tyrosine kinase) (AATYK) (Brain binding protein) + (p35BP). xxxxxxxxxxxxxx. zzzzzzzzzzzzzz.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: swissprot parsing DE
by FunkyMonk (Bishop) on May 31, 2007 at 08:54 UTC
    $/ = '//'; # input record separator print join( " ", grep { s/^DE\s+// } split /\n/ ), "\n" while <DATA>;

    Output:

    AC2-(EC 2.7.00.1) (Adaptor-associated protein 1). Serine/threonine-protein kinase (EC 2.7.99.1) (Tyrosine binding protei +n) (p35BP). Serine/threonine-aaaaaa kinase (Tyrosine kinase 1) (Apoptosis-associat +ed tyrosine kinase) (AATYK) (Brain binding protein) (p35BP).

    is how I'd do it;)

Re: swissprot parsing DE
by citromatik (Curate) on May 31, 2007 at 10:13 UTC

    If I understand your problem, you want to convert

    AB AAAA_BBBBB DE AC2-(EC 2.7.00.1) (Adaptor-associated DE protein 1). //

    into:

    AB AAAA_BBBBB DE AC2-(EC 2.7.00.1) (Adaptor-associated protein 1).

    (i.e, collapse all the lines beginning with the same "tag" for each "//" separated block).

    if this is what you intend, try this script

    #!/usr/bin/perl use strict; use warnings; $/ = "//\n"; while (my $block = <DATA>){ chomp $block; my @tags; ## keep the order of the tags my %vals; foreach my $line (split /\n/, $block){ $line =~ /(..)\s+(.+)$/; my ($tag,$val) = ($1,$2); if (!defined $vals{$tag}){ push @tags,$tag; $vals{$tag}=$val; }else{ $vals{$tag}.=" $val"; } } for my $tag (@tags){ print "$tag\t$vals{$tag}\n"; } print "//\n"; } __DATA__ AB AAAA_BBBBB DE AC2-(EC 2.7.00.1) (Adaptor-associated DE protein 1). // ID CCCCC_DDDDD DE Serine/threonine-protein kinase (EC 2.7.99.1) (Tyrosine DE binding protein) (p35BP). PR AAAAAAAAAAAAAAAAAAAAAAA. // ID RRRRR_GGGGG AC Q6Q8; DE Serine/threonine-aaaaaa kinase (Tyrosine DE kinase 1) (Apoptosis-associated tyrosine kinase) (AATYK) (Brain DE binding protein) (p35BP). PR xxxxxxxxxxxxxx. CD zzzzzzzzzzzzzz. //

    Outputs:

    AB AAAA_BBBBB DE AC2-(EC 2.7.00.1) (Adaptor-associated protein 1). // ID CCCCC_DDDDD DE Serine/threonine-protein kinase (EC 2.7.99.1) (Tyrosine bindin +g protein) (p35BP). PR AAAAAAAAAAAAAAAAAAAAAAA. // ID RRRRR_GGGGG AC Q6Q8; DE Serine/threonine-aaaaaa kinase (Tyrosine kinase 1) (Apoptosis- +associated tyrosine kinase) (AATYK) (Brain binding protein) (p35BP). PR xxxxxxxxxxxxxx. CD zzzzzzzzzzzzzz. //

    Hope this helps!

    citromatik