Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

finding and deleting repeated lines in a file

by Becky (Beadle)
on Jun 19, 2002 at 14:05 UTC ( #175686=perlquestion: print w/replies, xml ) Need Help??

Becky has asked for the wisdom of the Perl Monks concerning the following question:

I have a file of peptide sequences, 1 per line, and need to make sure there are no repeated sequences. How do I compare each sequence to every other sequence to check if there is a repeat, and delete the other occurrence(s) if found?
  • Comment on finding and deleting repeated lines in a file

Replies are listed 'Best First'.
•Re: finding and deleting repeated lines in a file
by merlyn (Sage) on Jun 19, 2002 at 14:22 UTC
Re: finding and deleting repeated lines in a file
by insensate (Hermit) on Jun 19, 2002 at 14:16 UTC
    Is the file too big to slurp into an array? If not...something like the following from the Cookbook should work:
    open(FILE,"peptides.txt"); @peptides=<FILE>; %seen=(); @unique=grep{! $seen{$_} ++} @peptides;
    Your unique peptides will now be in @unique
    -Jason
      Hi, thanks for that - it sort of works, but only seems to find the first repeat. For example if my file looks like: TRHF 0 KJKF 0 DFJE 0 DJFE 0 KSLR 0 SKJR 0 HGDF 0 TRHF 0 KJKF 0 it will remove the second 'TRHF 0' but stop there and leave all the other repeats. Any ideas? I'm quite new to perl so be gentle!
        Where are the newlines? Just out of curiosity...the example you posted above didn't have any...could you post a more accurate input example? Thanks,
        Jason
Re: finding and deleting repeated lines in a file
by bronto (Priest) on Jun 19, 2002 at 17:04 UTC

    Uhm... let me think... file is huge, so it is not advisable to keep all the non-repeated sequences inside a hash, or your memory will blow.

    I think it would be a good idea to use a message digest on the values to keep memory occupation low (at the expense of some CPU, of course); this also exposes you to the risk of two different sequences having the same digest -the probability should be low, but not null...

    I have no data to test and I never used Digest::MD5 directly, so take the subsequent code as a suggestion -it may suit your needs or be completely wrong. I'm looking at the documentation on http://search.cpan.org/doc/GAAS/Digest-MD5-2.20/MD5.pm

    use strict ; use warnings ; # ...if you have Perl 5.6 # read from stdin, spit data to stdout # (just to keep it simple) use Digest::MD5 qw(md5_hex) ; # or one of md5*'s my %digests ; while (my $line = <STDIN>) { my $dig = md5_hex($line) ; if (exists $digests{$dig}) { print STDERR "WARNING: duplicated checksum $dig for line $line +\nWARNING: skipping $line\n" ; $digest{$dig}++ ; # you can use this to count repetitions } else { $digest{$dig} = 0 ; print $line ; } }

    If this not what you need, I hope that at least this can help you to reach the better solution.

    --bronto
      A suggestion to avoid possible collisions: if CPU time is not a concern, using several different algorithms to create multiple fingerprints increases the improbability of a collision to astronomically high figures. Even using the same algorithm on the original string and a variant created by some transliteration rules to obtain multiple fingerprints will exponentially decrease the probability of collisions.

      Makeshifts last the longest.

Re: finding and deleting repeated lines in a file
by jmcnamara (Monsignor) on Jun 19, 2002 at 14:54 UTC

    If the file isn't huge then you could do this      perl -i.bak -ne 'print unless $h{$_}++' file

    The -i.bak gives you an effective in-place edit with a back-up in file.bak.

    --
    John.

Re: finding and deleting repeated lines in a file
by caedes (Pilgrim) on Jun 19, 2002 at 14:51 UTC
    If the file is really huge and your can't simply slurp the lines into a huge hash as the keys then you have two choices depending on whether or not the order of lines in the file is important.

    If order isn't important, ou can sort the file and then make a script that just removes consecutive repeated lines. This will save memory if your sort can work within the memory limitations.

    If order is important and the lines are quite large you can use Digest::MD5 to create a checksum of each line and then use the array of checksums to compare all the lines of the file. This will save some memory.

    I risk repeating what's already been said, but I think the previous posts were dancing around the issue.

Re: finding and deleting repeated lines in a file
by DamnDirtyApe (Curate) on Jun 19, 2002 at 14:23 UTC

      Which could be condensed to:

      perl -ne 'print unless $seen{$_}++' in.txt > out.txt

      and if you fancy replacing it directly, whilst makign a backup file:

      perl -i.bak -ne 'print unless $seen{$_}++' in.txt

      --
      Steve Marvell

Re: finding and deleting repeated lines in a file
by hotshot (Prior) on Jun 19, 2002 at 14:23 UTC
    read your file line by line to a hash where the sequence will be the hash keys. each line you read check if the key isn't already exist in the hash, if it is than continue with the next line. it should be something like this:
    my ($line, %hash); while ($line = <FILE>) { if ($line =~ /(your_sequence_format)/) { if (! exists($hash{$1})) { $hash{$1} = $line; } } } # now you can print the hash back to your file


    Thanks.

    Hotshot
      That will, of course, change the order.

      --
      Steve Marvell

        not if you also save in the hash the row number where the sequence was found and sort it by the number before printing

        Thanks.

        Hotshot
Re: finding and deleting repeated lines in a file
by marvell (Pilgrim) on Jun 19, 2002 at 14:34 UTC
    How long are the sequences and how long is the file? Can we have one line, for those of use who don't know what a peptide sequence looks like.

    It sounds like a quick hash check keyed on the sequence itself. If the data source is large, it'll need a tie and maybe some MD5 action.

    --
    Steve Marvell

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://175686]
Approved by VSarkiss
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2022-07-04 00:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?