how to remove duplicate strings?

heidi has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: how to remove duplicate strings? by GrandFather (Saint) on Oct 30, 2006 at 04:35 UTC
Add strictures (`use strict; use warnings;`) to your code, clean up the issues that creates, then see if the problem remains. As it stands there are a large number of variables initialised (maybe) but unused and a number of arrays are referenced, but their use is not clear. Your unique test looks fine. Your data reading looks like rubbish. Generate a sample script using __DATA__ to provide the data and show us what you get and what you expect. DWIM is Perl's answer to Gödel	[reply] [d/l]
Re^2: how to remove duplicate strings? by heidi (Sexton) on Oct 30, 2006 at 05:42 UTC
k, fine. to be very clear, i didnt want to confuse you all with my whole program,the ones which you said as UNUSED VALUES are not unused values, but i will be using it while printing the results later.so all i want to process now is the SEQUENCE. that is the string(a continous stretch of alphabets)which is next to the accessions line. so when i grep it and store it in a seperate array, and when i print the array (inside the loop) i am getting the output something like this MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLFGSSSSSSSSSSR [download] when i print the array outside the loop, either its printing the last string alone, or, removes the repeating alphabets from the string and printin a result like this. `MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVFKRIMCSQHTEPNL` [download] but the result which i need is: `MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLFGSSSSSSSSSSR` [download] and i need all this as 4 elements in the same array. i hope u got it.	[reply] [d/l] [select]
Re: how to remove duplicate strings? by graff (Chancellor) on Oct 30, 2006 at 05:24 UTC
Note that when you do this: `@array = $_;` [download] you are only assiging a single scalar value to @array -- the array now holds only one element, which is the entire content of $_. You might need to look at using split. Also, I'm a little puzzled about the use of spaces in these two lines: `elsif(/^TITLE/) {$title = (s/ /\n\t\t /g,$_); +} elsif(/^ORGANISM/){$org = (s/ /\n\t\t /g,$_);}` [download] It would be easier and more reliable to do those like this: `elsif(/^(TITLE)\s+(\S.)/) { $title = "$1\n\t\t $2\n" } elsif(/^(ORGANISM)\s+(\S.)/) { $org = "$1\n\t\t $2\n" }` [download] As for handling the "ACCESSIONS" line, if that's where you would want to the second code snippet to fit in, it could go like this: `elsif(/^ACCESSIONS\s+(\S.*)/) { my %seen = (); @accessions = grep { $seen{$_}++ == 0 } split /;\s+/, $1; }` [download] That use of grep with split does effectively the same thing as your second code snippet, but in one line instead of several. Still, as Gramps points out, you haven't really posed the question very well -- there don't seem to be any duplicate strings in your original data sample, I can only guess about how the second code snippet is supposed to fit in with the first one, and there's no way to tell what you're really trying to do with your array(s). Try posting a reply to him that follows his instructions. (updated to fix code tags, and to make sure %seen was initialized in my last code snippet)	[reply] [d/l] [select]
Re^2: how to remove duplicate strings? by heidi (Sexton) on Oct 30, 2006 at 06:02 UTC
hi graff, thanks for the reply. but,all i want to process now is the SEQUENCE. that is the string(a continous stretch of alphabets)which is next to the accessions line. so when i grep it and store it in a seperate array, and when i print the array (inside the loop) i am getting the output something like this MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL + +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSY +TAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNL +HGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVE +MGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEK +GKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEY +LENPKKYIP GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKN +KGIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNK +GIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKG +IIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGI +IWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGII +WGED GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAAN +KSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGG +SSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYI +PGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTG +QAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMK +CSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTL +MEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGA +AAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GD +VFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLFGSSSSSSSSSSR [download] when i print the array outside the loop, either its printing the last string alone, or, removes the repeating alphabets from the string and printin a result like this. `MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW MGDVEKI +FCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHT +PNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAY +W GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVE +KIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVFKRIMCSQHTEPNL` [download] but the result which i need is: `MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL + +MEYLENPKKYIP GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYT +AANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQA +PGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVFKGKRIFIMKCS +QCHTVESSSSKGGKHKTGPNLHGLFGSSSSSSSSSSR` [download] and i need all this as 4 elements in the same array. i hope u got it.	[reply] [d/l] [select]
Re^3: how to remove duplicate strings? by graff (Chancellor) on Oct 30, 2006 at 06:39 UTC
Well no, I'm not sure that I got it. What is clear is that you did not satisfy GrandFather's request in the first reply, as I hoped you would. So let me make another guess at what you really want. How about this: `my @arr = (); while (<PIR>) { chomp; if( /^ENTRY/ ) { $entry = $_ } elsif ( /^(TITLE)\s+(\S.)/ ) { $title = "$1\n\t $2" } elsif ( /^(ORGANISM)\s+(\S.)/ ) { $org = "$1\n\t $2" } elsif ( /^ACCESSIONS/ ) { $acc = $_ } else { push @arr, $_; } } print "@arr\n";` [download] Now, I would assume there should be more code than that, if you really need to do things with $acc, $entry, $org and $title. If you really just want to output an array with those long strings as the elements of the array, the code could be a lot simpler. If there's a chance that one of those long strings might appear more than once in the data file, use those long strings as hash keys instead of array values: `# simplified version: ignore header stuff: my %hash; while(<PIR>) { chomp; $hash{$_} = undef unless /^(?:ENTRY\|TITLE\|ORGANISM\|ACCESSIONS)\s/; } print join " ", keys %hash, "\n";` [download] Using a hash like that might be a good idea for other reasons: maybe you would want the header values to be associated with each long string. (Hint: some people refer to hashes as "associative arrays".) If so, assign the header strings as the hash value.	[reply] [d/l] [select]
Re^4: how to remove duplicate strings? by heidi (Sexton) on Oct 30, 2006 at 09:08 UTC
Re: how to remove duplicate strings? by bobf (Monsignor) on Oct 30, 2006 at 06:04 UTC
If you're trying to get a unique seq of protein sequences from a PIR file, I'd suggest using a hash of arrays. Use the sequence as the key and store the accession number (or a hash or array of all attribs, or an object representing the record) in the array. `$hash{$sequence} = [ $accnum1, $accnum2, ... ]; $hash{$sequence} = [ { ENTRY => ... TITLE => ... ORGANISM => ... ACCESSIONS => ... }, ];` [download] If you don't need all of the other data in the record, you can use `$hash{$sequence} = $count` to track how many duplicates were observed. Calling keys on the hash returns the unique set of sequences. If there are a very large number of sequences in the input file, you might be better off using a database.	[reply] [d/l] [select]
Re: how to remove duplicate strings? by holcapek (Sexton) on Oct 30, 2006 at 11:48 UTC
I like (and prefer and often do) this way: `my @str = qw(aaa bbb ccc ddd eee ddd aaa bbb ccc); my %uniq = (); my @uniq_str = grep { ! $uniq{$_}++ } @str;` [download]	[reply] [d/l]