Re^2: how to remove duplicate strings?

hi graff, thanks for the reply. but,all i want to process now is the SEQUENCE. that is the string(a continous stretch of alphabets)which is next to the accessions line. so when i grep it and store it in a seperate array, and when i print the array (inside the loop) i am getting the output something like this

MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL
+ +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSY
+TAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNL
+HGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVE
+MGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEK
+GKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEY
+LENPKKYIP GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKN
+KGIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNK
+GIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKG
+IIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGI
+IWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGII
+WGED GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAAN
+KSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGG
+SSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYI
+PGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTG
+QAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMK
+CSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTL
+MEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGA
+AAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GD
+VFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLFGSSSSSSSSSSR
[download]

when i print the array outside the loop, either its printing the last string alone, or, removes the repeating alphabets from the string and printin a result like this.

MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW MGDVEKI
+FCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHT
+PNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAY
+W GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVE
+KIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVFKRIMCSQHTEPNL
[download]

but the result which i need is:

MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL
+ +MEYLENPKKYIP GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYT
+AANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQA
+PGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVFKGKRIFIMKCS
+QCHTVESSSSKGGKHKTGPNLHGLFGSSSSSSSSSSR
[download]

and i need all this as 4 elements in the same array. i hope u got it.

Comment on Re^2: how to remove duplicate strings? Select or Download Code

Replies are listed 'Best First'.
Re^3: how to remove duplicate strings? by graff (Chancellor) on Oct 30, 2006 at 06:39 UTC
Well no, I'm not sure that I got it. What is clear is that you did not satisfy GrandFather's request in the first reply, as I hoped you would. So let me make another guess at what you really want. How about this: `my @arr = (); while (<PIR>) { chomp; if( /^ENTRY/ ) { $entry = $_ } elsif ( /^(TITLE)\s+(\S.)/ ) { $title = "$1\n\t $2" } elsif ( /^(ORGANISM)\s+(\S.)/ ) { $org = "$1\n\t $2" } elsif ( /^ACCESSIONS/ ) { $acc = $_ } else { push @arr, $_; } } print "@arr\n";` [download] Now, I would assume there should be more code than that, if you really need to do things with $acc, $entry, $org and $title. If you really just want to output an array with those long strings as the elements of the array, the code could be a lot simpler. If there's a chance that one of those long strings might appear more than once in the data file, use those long strings as hash keys instead of array values: `# simplified version: ignore header stuff: my %hash; while(<PIR>) { chomp; $hash{$_} = undef unless /^(?:ENTRY\|TITLE\|ORGANISM\|ACCESSIONS)\s/; } print join " ", keys %hash, "\n";` [download] Using a hash like that might be a good idea for other reasons: maybe you would want the header values to be associated with each long string. (Hint: some people refer to hashes as "associative arrays".) If so, assign the header strings as the hash value.	[reply] [d/l] [select]
Re^4: how to remove duplicate strings? by heidi (Sexton) on Oct 30, 2006 at 09:08 UTC
hey graff, thank ya, u got my problem rite.i tried writting the code the way u said, and i got the answer, but the problem which i am facing now is, i had to save each element of that array in to a new array and split the characters. to make it clear, the program is now like this. `open (PIR,'/home/guest/sampir.txt'); my @arr = (); while (<PIR>) { chomp; if( /^ENTRY/ ) { $entry = $_ } elsif ( /^(TITLE)\s+(\S.)/ ) { $title = "$1\n\t $2" } elsif ( /^(ORGANISM)\s+(\S.)/ ) { $org = "$1\n\t $2" } elsif ( /^ACCESSIONS/ ) { $acc = $_ } else { push @se, $_; } }` [download] and i tried splitting it up like this `foreach $r(@se) { @y=split(//,$r); }` [download] but am not getting the answer. how to go abt it.?	[reply] [d/l] [select]