heidi has asked for the wisdom of the Perl Monks concerning the following question:

hi all, i have written a program to seperate strings from a data file. the problem is, i am getting duplicates of the original copy of the strings. i tried removing it, but in vain.....well, here is my program,
open (PIR,'/home/sampir.txt'); while (<PIR>) { if (/^ENTRY/) {$entry = $_;} elsif(/^TITLE/) {$title = (s/ /\n\t\t /g,$_); +} elsif(/^ORGANISM/){$org = (s/ /\n\t\t /g,$_);} elsif(/^ACCESSIONS/){$acc = $_;} else { @arr = $_; } if (defined $array2[0]) { @array = split('',$arr[0]); } } print @array;
and this is the sample data file:
ENTRY CCHU #type complete TITLE cytochrome c [validated] - human ORGANISM #formal_name Homo sapiens #common_name man ACCESSIONS A31764; A05676; I55192; A00001 MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP ENTRY CCCZ #type complete TITLE cytochrome c - chimpanzee (tentative sequence) ORGANISM #formal_name Pan troglodytes #common_name chimpanzee ACCESSIONS A00002 GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGED ENTRY CCMQR #type complete TITLE cytochrome c - rhesus macaque (tentative sequence) ORGANISM #formal_name Macaca mulatta #common_name rhesus macaq +ue ACCESSIONS A00003 GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE ENTRY CCMKP #type complete TITLE cytochrome c - spider monkey ORGANISM #formal_name Ateles sp. #common_name spider monkey ACCESSIONS A00004 GDVFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLFGSSSSSSSSSSR
I refered perldoc and i tried using,
my @unique = (); my %seen = (); foreach my $elem ( @array ) { next if $seen{ $elem }++; push @unique, $elem; }
But what it does is, it removes the alphabets which repeats within the string.i dont want that to happen,i want all the 4 strings(the one which is next to accession line) in an array without duplicate strings. Plz help me out. thanks.

Replies are listed 'Best First'.
Re: how to remove duplicate strings?
by GrandFather (Saint) on Oct 30, 2006 at 04:35 UTC

    Add strictures (use strict; use warnings;) to your code, clean up the issues that creates, then see if the problem remains.

    As it stands there are a large number of variables initialised (maybe) but unused and a number of arrays are referenced, but their use is not clear. Your unique test looks fine. Your data reading looks like rubbish.

    Generate a sample script using __DATA__ to provide the data and show us what you get and what you expect.


    DWIM is Perl's answer to Gödel
      k, fine. to be very clear, i didnt want to confuse you all with my whole program,the ones which you said as UNUSED VALUES are not unused values, but i will be using it while printing the results later.so all i want to process now is the SEQUENCE. that is the string(a continous stretch of alphabets)which is next to the accessions line. so when i grep it and store it in a seperate array, and when i print the array (inside the loop) i am getting the output something like this
      MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLFGSSSSSSSSSSR
      when i print the array outside the loop, either its printing the last string alone, or, removes the repeating alphabets from the string and printin a result like this.
      MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVFKRIMCSQHTEPNL
      but the result which i need is:
      MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLFGSSSSSSSSSSR
      and i need all this as 4 elements in the same array. i hope u got it.
Re: how to remove duplicate strings?
by graff (Chancellor) on Oct 30, 2006 at 05:24 UTC
    Note that when you do this:
    @array = $_;
    you are only assiging a single scalar value to @array -- the array now holds only one element, which is the entire content of $_. You might need to look at using split.

    Also, I'm a little puzzled about the use of spaces in these two lines:

    elsif(/^TITLE/) {$title = (s/ /\n\t\t /g,$_); +} elsif(/^ORGANISM/){$org = (s/ /\n\t\t /g,$_);}
    It would be easier and more reliable to do those like this:
    elsif(/^(TITLE)\s+(\S.*)/) { $title = "$1\n\t\t $2\n" } elsif(/^(ORGANISM)\s+(\S.*)/) { $org = "$1\n\t\t $2\n" }

    As for handling the "ACCESSIONS" line, if that's where you would want to the second code snippet to fit in, it could go like this:

    elsif(/^ACCESSIONS\s+(\S.*)/) { my %seen = (); @accessions = grep { $seen{$_}++ == 0 } split /;\s+/, $1; }
    That use of grep with split does effectively the same thing as your second code snippet, but in one line instead of several.

    Still, as Gramps points out, you haven't really posed the question very well -- there don't seem to be any duplicate strings in your original data sample, I can only guess about how the second code snippet is supposed to fit in with the first one, and there's no way to tell what you're really trying to do with your array(s). Try posting a reply to him that follows his instructions.

    (updated to fix code tags, and to make sure %seen was initialized in my last code snippet)

      hi graff, thanks for the reply. but,all i want to process now is the SEQUENCE. that is the string(a continous stretch of alphabets)which is next to the accessions line. so when i grep it and store it in a seperate array, and when i print the array (inside the loop) i am getting the output something like this
      MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL + +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSY +TAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNL +HGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEKGKKIFIMKCSQCHTVE +MGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEYLENPKKYIP MGDVEK +GKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL +MEY +LENPKKYIP GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKN +KGIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNK +GIIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKG +IIWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGI +IWGED GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYTAANKNKGII +WGED GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAAN +KSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGG +SSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYI +PGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTG +QAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMK +CSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTL +MEYLENPKKYIPGTKMIFVGIKKKEE GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGA +AAAAAAARKTGQAPGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GD +VFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLFGSSSSSSSSSSR
      when i print the array outside the loop, either its printing the last string alone, or, removes the repeating alphabets from the string and printin a result like this.
      MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW MGDVEKI +FCSQHTPNLRAYW MGDVEKIFCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHT +PNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAYW GDVEKIFMCSQHTPNLRAY +W GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVE +KIFMCSQHTPNLARYW GDVEKIFMCSQHTPNLARYW GDVFKRIMCSQHTEPNL
      but the result which i need is:
      MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTL + +MEYLENPKKYIP GDVEKGKKIFIMKCSQCHTVEKGSSSKHKSSSTGPNLHGLFGRKTGQAPGYSYT +AANKNKGIIWGED GDVEKGKKIFIMKCSQCHTVEKGGSSSSKHKTGPNLHGLFGAAAAAAAARKTGQA +PGYSYTAANKSSSSN +KGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE GDVFKGKRIFIMKCS +QCHTVESSSSKGGKHKTGPNLHGLFGSSSSSSSSSSR
      and i need all this as 4 elements in the same array. i hope u got it.
        Well no, I'm not sure that I got it. What is clear is that you did not satisfy GrandFather's request in the first reply, as I hoped you would.

        So let me make another guess at what you really want. How about this:

        my @arr = (); while (<PIR>) { chomp; if( /^ENTRY/ ) { $entry = $_ } elsif ( /^(TITLE)\s+(\S.*)/ ) { $title = "$1\n\t $2" } elsif ( /^(ORGANISM)\s+(\S.*)/ ) { $org = "$1\n\t $2" } elsif ( /^ACCESSIONS/ ) { $acc = $_ } else { push @arr, $_; } } print "@arr\n";
        Now, I would assume there should be more code than that, if you really need to do things with $acc, $entry, $org and $title. If you really just want to output an array with those long strings as the elements of the array, the code could be a lot simpler.

        If there's a chance that one of those long strings might appear more than once in the data file, use those long strings as hash keys instead of array values:

        # simplified version: ignore header stuff: my %hash; while(<PIR>) { chomp; $hash{$_} = undef unless /^(?:ENTRY|TITLE|ORGANISM|ACCESSIONS)\s/; } print join " ", keys %hash, "\n";
        Using a hash like that might be a good idea for other reasons: maybe you would want the header values to be associated with each long string. (Hint: some people refer to hashes as "associative arrays".) If so, assign the header strings as the hash value.
Re: how to remove duplicate strings?
by bobf (Monsignor) on Oct 30, 2006 at 06:04 UTC

    If you're trying to get a unique seq of protein sequences from a PIR file, I'd suggest using a hash of arrays. Use the sequence as the key and store the accession number (or a hash or array of all attribs, or an object representing the record) in the array.

    $hash{$sequence} = [ $accnum1, $accnum2, ... ]; $hash{$sequence} = [ { ENTRY => ... TITLE => ... ORGANISM => ... ACCESSIONS => ... }, ];
    If you don't need all of the other data in the record, you can use $hash{$sequence} = $count to track how many duplicates were observed.

    Calling keys on the hash returns the unique set of sequences.

    If there are a very large number of sequences in the input file, you might be better off using a database.

Re: how to remove duplicate strings?
by holcapek (Sexton) on Oct 30, 2006 at 11:48 UTC
    I like (and prefer and often do) this way:
    my @str = qw(aaa bbb ccc ddd eee ddd aaa bbb ccc); my %uniq = (); my @uniq_str = grep { ! $uniq{$_}++ } @str;