Kenin has asked for the wisdom of the Perl Monks concerning the following question:
I am just a beginner in Perl Scripting...... Please help me out in writing a script for this Query.... I have dataset at the bottom of this page....
This data shows some gene sequences of bacteria. Normally our task is to analysis sequences ranging from few 1,000 - 10,000 at a time....Problems: There are number of duplicates in the dataset with the same Id..
To Solve....I need to compile a code which can eliminate the redundancy and preserve on one copy of the gene sequence... ( For ex. 49329899 should occur only once in the dataset...... )
The alphabets present in the 2nd and 3rd line should also be preserved in the dataset.....Below example have two records of “gi|49329899”
Thanks in advance keninData.txt >gi|49329899|gb|AAT60545.1| chitinase [Bacillus thuringiensis serovar +konkukian str. 97-27 MLNKFKFICCTLVIFLLLPLAPFQAQAANNLGSKLLVGYWHNFDNGTGIIKLRDVSPKWDVINVSFGETG GDRSTVEFSPVYGTDAEFKSDISYLKSKGKKVVLSIGGQNGVVLLPDNAAKQRFINSIQSLIDKYGFDGI >gi|49330053|gb|AAT60699.1| chitinase [Bacillus thuringiensis serovar +konkukian str. 97-27 MKSKKFTLLLLSLLLFLPLFLTNFITPNVVLADSQKQDQKIVGYFPSWGIYGRNYQVADIDASKLTHLNY AFADICWNGKHGNPSTHPDNPNKQTWNCKESGVPLQNKEVPNGTLVLGEPWADVTKSYPGSGTTWEDCDK >gi|49478343|ref|YP_037789.1| chitinase [Bacillus thuringiensis serova +r konkukian str. 97-27] MLNKFKFICCTLVIFLLLPLAPFQAQAANNLGSKLLVGYWHNFDNGTGIIKLRDVSPKWDVINVSFGETG GDRSTVEFSPVYGTDAEFKSDISYLKSKGKKVVLSIGGQNGVVLLPDNAAKQRFINSIQSLIDKYGFDGI >gi|49329899|gb|AAT60545.1| chitinase [Bacillus thuringiensis serovar +konkukian str. 97-27 MLNKFKFICCTLVIFLLLPLAPFQAQAANNLGSKLLVGYWHNFDNGTGIIKLRDVSPKWDVINVSFGETG GDRSTVEFSPVYGTDAEFKSDISYLKSKGKKVVLSIGGQNGVVLLPDNAAKQRFINSIQSLIDKYGFDGI >gi|49478497|ref|YP_034712.1| chitinase [Bacillus thuringiensis serova +r konkukian str. 97-27] MKSKKFTLLLLSLLLFLPLFLTNFITPNVVLADSQKQDQKIVGYFPSWGIYGRNYQVADIDASKLTHLNY AFADICWNGKHGNPSTHPDNPNKQTWNCKESGVPLQNKEVPNGTLVLGEPWADVTKSYPGSGTTWEDCDK YARCGNFGELKRLKAKYPHLKTIISVGGWTWSNRFSDMAADEKTRKVFADSTVDFLREYGFDGVDLDWEY
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: How to eliminate redundancy in huge dataset (1,000 - 10,000)
by CountZero (Bishop) on May 07, 2008 at 19:07 UTC | |
|
Re: How to eliminate redundancy in huge dataset (1,000 - 10,000)
by pc88mxer (Vicar) on May 07, 2008 at 18:43 UTC | |
|
Re: How to eliminate redundancy in huge dataset (1,000 - 10,000)
by salva (Canon) on May 07, 2008 at 18:42 UTC | |
| |
|
Re: How to eliminate redundancy in huge dataset (1,000 - 10,000)
by pileofrogs (Priest) on May 07, 2008 at 18:47 UTC |