in reply to how to remove duplicate strings?

Note that when you do this:
@array = $_;
you are only assiging a single scalar value to @array -- the array now holds only one element, which is the entire content of $_. You might need to look at using split.

Also, I'm a little puzzled about the use of spaces in these two lines:

elsif(/^TITLE/) {$title = (s/ /\n\t\t /g,$_); +} elsif(/^ORGANISM/){$org = (s/ /\n\t\t /g,$_);}
It would be easier and more reliable to do those like this:
elsif(/^(TITLE)\s+(\S.*)/) { $title = "$1\n\t\t $2\n" } elsif(/^(ORGANISM)\s+(\S.*)/) { $org = "$1\n\t\t $2\n" }

As for handling the "ACCESSIONS" line, if that's where you would want to the second code snippet to fit in, it could go like this:

elsif(/^ACCESSIONS\s+(\S.*)/) { my %seen = (); @accessions = grep { $seen{$_}++ == 0 } split /;\s+/, $1; }
That use of grep with split does effectively the same thing as your second code snippet, but in one line instead of several.

Still, as Gramps points out, you haven't really posed the question very well -- there don't seem to be any duplicate strings in your original data sample, I can only guess about how the second code snippet is supposed to fit in with the first one, and there's no way to tell what you're really trying to do with your array(s). Try posting a reply to him that follows his instructions.

(updated to fix code tags, and to make sure %seen was initialized in my last code snippet)

Replies are listed 'Best First'.
Re^2: how to remove duplicate strings?
by heidi (Sexton) on Oct 30, 2006 at 06:02 UTC
    hi graff, thanks for the reply. but,all i want to process now is the SEQUENCE. that is the string(a continous stretch of alphabets)which is next to the accessions line. so when i grep it and store it in a seperate array, and when i print the array (inside the loop) i am getting the output something like this
    MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL + +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSY +TAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNL +HGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVE +MGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEK +GKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEY +LENPKKYIP GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKN +KGIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNK +GIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKG +IIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGI +IWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGII +WGED GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAAN +KSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGG +SSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYI +PGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTG +QAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMK +CSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTL +MEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGA +AAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GD +VFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLFGSSSSSSSSSSR
    when i print the array outside the loop, either its printing the last string alone, or, removes the repeating alphabets from the string and printin a result like this.
    MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW MGDVEKI +FCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHT +PNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAY +W GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVE +KIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVFKRIMCSQHTEPNL
    but the result which i need is:
    MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL + +MEYLENPKKYIP GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYT +AANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQA +PGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVFKGKRIFIMKCS +QCHTVESSSSKGGKHKTGPNLHGLFGSSSSSSSSSSR
    and i need all this as 4 elements in the same array. i hope u got it.
      Well no, I'm not sure that I got it. What is clear is that you did not satisfy GrandFather's request in the first reply, as I hoped you would.

      So let me make another guess at what you really want. How about this:

      my @arr = (); while (<PIR>) { chomp; if( /^ENTRY/ ) { $entry = $_ } elsif ( /^(TITLE)\s+(\S.*)/ ) { $title = "$1\n\t $2" } elsif ( /^(ORGANISM)\s+(\S.*)/ ) { $org = "$1\n\t $2" } elsif ( /^ACCESSIONS/ ) { $acc = $_ } else { push @arr, $_; } } print "@arr\n";
      Now, I would assume there should be more code than that, if you really need to do things with $acc, $entry, $org and $title. If you really just want to output an array with those long strings as the elements of the array, the code could be a lot simpler.

      If there's a chance that one of those long strings might appear more than once in the data file, use those long strings as hash keys instead of array values:

      # simplified version: ignore header stuff: my %hash; while(<PIR>) { chomp; $hash{$_} = undef unless /^(?:ENTRY|TITLE|ORGANISM|ACCESSIONS)\s/; } print join " ", keys %hash, "\n";
      Using a hash like that might be a good idea for other reasons: maybe you would want the header values to be associated with each long string. (Hint: some people refer to hashes as "associative arrays".) If so, assign the header strings as the hash value.
        hey graff, thank ya, u got my problem rite.i tried writting the code the way u said, and i got the answer, but the problem which i am facing now is, i had to save each element of that array in to a new array and split the characters. to make it clear, the program is now like this.
        open (PIR,'/home/guest/sampir.txt'); my @arr = (); while (<PIR>) { chomp; if( /^ENTRY/ ) { $entry = $_ } elsif ( /^(TITLE)\s+(\S.*)/ ) { $title = "$1\n\t $2" } elsif ( /^(ORGANISM)\s+(\S.*)/ ) { $org = "$1\n\t $2" } elsif ( /^ACCESSIONS/ ) { $acc = $_ } else { push @se, $_; } }
        and i tried splitting it up like this
        foreach $r(@se) { @y=split(//,$r); }
        but am not getting the answer. how to go abt it.?