in reply to swissprot parsing DE
If I understand your problem, you want to convert
AB AAAA_BBBBB DE AC2-(EC 2.7.00.1) (Adaptor-associated DE protein 1). //
into:
AB AAAA_BBBBB DE AC2-(EC 2.7.00.1) (Adaptor-associated protein 1).
(i.e, collapse all the lines beginning with the same "tag" for each "//" separated block).
if this is what you intend, try this script
#!/usr/bin/perl use strict; use warnings; $/ = "//\n"; while (my $block = <DATA>){ chomp $block; my @tags; ## keep the order of the tags my %vals; foreach my $line (split /\n/, $block){ $line =~ /(..)\s+(.+)$/; my ($tag,$val) = ($1,$2); if (!defined $vals{$tag}){ push @tags,$tag; $vals{$tag}=$val; }else{ $vals{$tag}.=" $val"; } } for my $tag (@tags){ print "$tag\t$vals{$tag}\n"; } print "//\n"; } __DATA__ AB AAAA_BBBBB DE AC2-(EC 2.7.00.1) (Adaptor-associated DE protein 1). // ID CCCCC_DDDDD DE Serine/threonine-protein kinase (EC 2.7.99.1) (Tyrosine DE binding protein) (p35BP). PR AAAAAAAAAAAAAAAAAAAAAAA. // ID RRRRR_GGGGG AC Q6Q8; DE Serine/threonine-aaaaaa kinase (Tyrosine DE kinase 1) (Apoptosis-associated tyrosine kinase) (AATYK) (Brain DE binding protein) (p35BP). PR xxxxxxxxxxxxxx. CD zzzzzzzzzzzzzz. //
Outputs:
AB AAAA_BBBBB DE AC2-(EC 2.7.00.1) (Adaptor-associated protein 1). // ID CCCCC_DDDDD DE Serine/threonine-protein kinase (EC 2.7.99.1) (Tyrosine bindin +g protein) (p35BP). PR AAAAAAAAAAAAAAAAAAAAAAA. // ID RRRRR_GGGGG AC Q6Q8; DE Serine/threonine-aaaaaa kinase (Tyrosine kinase 1) (Apoptosis- +associated tyrosine kinase) (AATYK) (Brain binding protein) (p35BP). PR xxxxxxxxxxxxxx. CD zzzzzzzzzzzzzz. //
Hope this helps!
citromatik
|
|---|