How to eliminate redundancy in huge dataset (1,000

Kenin has asked for the wisdom of the Perl Monks concerning the following question:

Hello people....

I am just a beginner in Perl Scripting...... Please help me out in writing a script for this Query.... I have dataset at the bottom of this page....

This data shows some gene sequences of bacteria. Normally our task is to analysis sequences ranging from few 1,000 - 10,000 at a time....

Problems: There are number of duplicates in the dataset with the same Id..

To Solve....

I need to compile a code which can eliminate the redundancy and preserve on one copy of the gene sequence... ( For ex. 49329899 should occur only once in the dataset...... )

The alphabets present in the 2nd and 3rd line should also be preserved in the dataset.....

Below example have two records of “gi|49329899”

Data.txt  
>gi|49329899|gb|AAT60545.1| chitinase [Bacillus thuringiensis serovar 
+konkukian str. 97-27
MLNKFKFICCTLVIFLLLPLAPFQAQAANNLGSKLLVGYWHNFDNGTGIIKLRDVSPKWDVINVSFGETG
GDRSTVEFSPVYGTDAEFKSDISYLKSKGKKVVLSIGGQNGVVLLPDNAAKQRFINSIQSLIDKYGFDGI
 
>gi|49330053|gb|AAT60699.1| chitinase [Bacillus thuringiensis serovar 
+konkukian str. 97-27
MKSKKFTLLLLSLLLFLPLFLTNFITPNVVLADSQKQDQKIVGYFPSWGIYGRNYQVADIDASKLTHLNY
AFADICWNGKHGNPSTHPDNPNKQTWNCKESGVPLQNKEVPNGTLVLGEPWADVTKSYPGSGTTWEDCDK
 
>gi|49478343|ref|YP_037789.1| chitinase [Bacillus thuringiensis serova
+r konkukian str. 97-27]
MLNKFKFICCTLVIFLLLPLAPFQAQAANNLGSKLLVGYWHNFDNGTGIIKLRDVSPKWDVINVSFGETG
GDRSTVEFSPVYGTDAEFKSDISYLKSKGKKVVLSIGGQNGVVLLPDNAAKQRFINSIQSLIDKYGFDGI
 
>gi|49329899|gb|AAT60545.1| chitinase [Bacillus thuringiensis serovar 
+konkukian str. 97-27
MLNKFKFICCTLVIFLLLPLAPFQAQAANNLGSKLLVGYWHNFDNGTGIIKLRDVSPKWDVINVSFGETG
GDRSTVEFSPVYGTDAEFKSDISYLKSKGKKVVLSIGGQNGVVLLPDNAAKQRFINSIQSLIDKYGFDGI
 
>gi|49478497|ref|YP_034712.1| chitinase [Bacillus thuringiensis serova
+r konkukian str. 97-27]
MKSKKFTLLLLSLLLFLPLFLTNFITPNVVLADSQKQDQKIVGYFPSWGIYGRNYQVADIDASKLTHLNY
AFADICWNGKHGNPSTHPDNPNKQTWNCKESGVPLQNKEVPNGTLVLGEPWADVTKSYPGSGTTWEDCDK
YARCGNFGELKRLKAKYPHLKTIISVGGWTWSNRFSDMAADEKTRKVFADSTVDFLREYGFDGVDLDWEY
[download]

Thanks in advance kenin

Comment on How to eliminate redundancy in huge dataset (1,000 - 10,000) Download Code

Replies are listed 'Best First'.
Re: How to eliminate redundancy in huge dataset (1,000 - 10,000) by CountZero (Bishop) on May 07, 2008 at 19:07 UTC
No need to check each and every record against a hash! Just save it directly into a hash. Duplicates get conveniently overwritten. To get the result use the `values` operator on the hash and you get a nice list of all your unique records. use strict; my %database; while (my $record = <DATA>) { $database{(split /\\|/, $record, 3)[1]} = $record; } print values %database; __DATA__ >gi\|49329899\|gb\|AAT60545.1\| chitinase [Bacillus thuringiensis serovar +konkukian str. 97-27 MLNKFKFICCTLVIFLLLPLAPFQAQAANNLGSKLLVGYWHNFDNGTG +IIKLRDVSPKWDVINVSFGETG GDRSTVEFSPVYGTDAEFKSDISYLKSKGKKVVLSIGGQNGVVLLP +DNAAKQRFINSIQSLIDKYGFDGI >gi\|49330053\|gb\|AAT60699.1\| chitinase [Bacillus thuringiensis serovar +konkukian str. 97-27 MKSKKFTLLLLSLLLFLPLFLTNFITPNVVLADSQKQDQKIVGYFPSW +GIYGRNYQVADIDASKLTHLNY AFADICWNGKHGNPSTHPDNPNKQTWNCKESGVPLQNKEVPNGTLV +LGEPWADVTKSYPGSGTTWEDCDK >gi\|49478343\|ref\|YP_037789.1\| chitinase Bacillus thuringiensis serovar + konkukian str. 97-27 MLNKFKFICCTLVIFLLLPLAPFQAQAANNLGSKLLVGYWHNFDNGT +GIIKLRDVSPKWDVINVSFGETG GDRSTVEFSPVYGTDAEFKSDISYLKSKGKKVVLSIGGQNGVVLL +PDNAAKQRFINSIQSLIDKYGFDGI >gi\|49329899\|gb\|AAT60545.1\| chitinase [Bacillus thuringiensis serovar +konkukian str. 97-27 MLNKFKFICCTLVIFLLLPLAPFQAQAANNLGSKLLVGYWHNFDNGTG +IIKLRDVSPKWDVINVSFGETG GDRSTVEFSPVYGTDAEFKSDISYLKSKGKKVVLSIGGQNGVVLLP +DNAAKQRFINSIQSLIDKYGFDGI >gi\|49478497\|ref\|YP_034712.1\| chitinase Bacillus thuringiensis serovar + konkukian str. 97-27 MKSKKFTLLLLSLLLFLPLFLTNFITPNVVLADSQKQDQKIVGYFPS +WGIYGRNYQVADIDASKLTHLNY AFADICWNGKHGNPSTHPDNPNKQTWNCKESGVPLQNKEVPNGTL +VLGEPWADVTKSYPGSGTTWEDCDK YARCGNFGELKRLKAKYPHLKTIISVGGWTWSNRFSDMAADEK +TRKVFADSTVDFLREYGFDGVDLDWEY [download] Output: >gi\|49478497\|ref\|YP_034712.1\| chitinase Bacillus thuringiensis serovar + konkukian str. 97-27 MKSKKFTLLLLSLLLFLPLFLTNFITPNVVLADSQKQDQKIVGYFPS +WGIYGRNYQVADIDASKLTHLNY AFADICWNGKHGNPSTHPDNPNKQTWNCKESGVPLQNKEVPNGTL +VLGEPWADVTKSYPGSGTTWEDCDK YARCGNFGELKRLKAKYPHLKTIISVGGWTWSNRFSDMAADEK +TRKVFADSTVDFLREYGFDGVDLDWEY >gi\|49330053\|gb\|AAT60699.1\| chitinase [Bacillus thuringiensis serovar +konkukian str. 97-27 MKSKKFTLLLLSLLLFLPLFLTNFITPNVVLADSQKQDQKIVGYFPSW +GIYGRNYQVADIDASKLTHLNY AFADICWNGKHGNPSTHPDNPNKQTWNCKESGVPLQNKEVPNGTLV +LGEPWADVTKSYPGSGTTWEDCDK >gi\|49478343\|ref\|YP_037789.1\| chitinase Bacillus thuringiensis serovar + konkukian str. 97-27 MLNKFKFICCTLVIFLLLPLAPFQAQAANNLGSKLLVGYWHNFDNGT +GIIKLRDVSPKWDVINVSFGETG GDRSTVEFSPVYGTDAEFKSDISYLKSKGKKVVLSIGGQNGVVLL +PDNAAKQRFINSIQSLIDKYGFDGI >gi\|49329899\|gb\|AAT60545.1\| chitinase [Bacillus thuringiensis serovar +konkukian str. 97-27 MLNKFKFICCTLVIFLLLPLAPFQAQAANNLGSKLLVGYWHNFDNGTG +IIKLRDVSPKWDVINVSFGETG GDRSTVEFSPVYGTDAEFKSDISYLKSKGKKVVLSIGGQNGVVLLP +DNAAKQRFINSIQSLIDKYGFDGI [download] CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: How to eliminate redundancy in huge dataset (1,000 - 10,000) by pc88mxer (Vicar) on May 07, 2008 at 18:43 UTC
This is an easy job for perl. The general idea is: `my %seen; while (<>) { chomp; my @fields = split('\\|', $_); if (my $record = $seen{$fields[0]}) { ... append @fields to $record... } else { $seen{$fields[0]} = ...some data structure... } } for my $k (keys %seen) { my $record = $seen{$k}; # process $record }` [download] The reason for the ellipses is I'm not exactly sure what data you want to parse out, and after looking at your data more carefully, I realize that I should ask some questions about its structure. Is each record one long line or is it 4 lines consisting of a `\|` separated line, two alphabet lines and a blank line? Also, do you know if your input is sorted? Even though perl can handle reading in 10K lines into memory, we can optimize the code if we know that the input is already sorted by id. For the following I will assume that each record is 4 lines: Read more... (835 Bytes)	[reply] [d/l] [select]
Re: How to eliminate redundancy in huge dataset (1,000 - 10,000) by salva (Canon) on May 07, 2008 at 18:42 UTC
Nowadays, a dataset with 1,000 - 10,000 elements is not huge any more! The common way to eliminate duplicates in Perl is to use a hash: `my %seen; while (<>) { my @parts = split /\\|/; next if $seen{$parts[1]}++; print; }` [download] If the data were huge and for your particular case where the key is an integer, you could use a bit vector to record the seen entries in order to reduce memory consumption. And for really huge data sets, an external sort & postprocess algorithm would be the better aproach.	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: How to eliminate redundancy in huge dataset (1,000 - 10,000) by pileofrogs (Priest) on May 07, 2008 at 18:47 UTC
I think you just loop through the data set and keep a hash of the IDs. With each record, you check if the ID is in the hash , if it is, skip to the next record, if it isn't, write it to the output file and then skip to the next record. 10,000 records isn't really that big. I used to bulls-eye womp rats with my t-16 back home, and they're not much smaller than 10,000 records. er... I mean, I process batches of 10,000 records all the time. --Pileofrogs	[reply]