comment on

Uhm... let me think... file is huge, so it is not advisable to keep all the non-repeated sequences inside a hash, or your memory will blow.

I think it would be a good idea to use a message digest on the values to keep memory occupation low (at the expense of some CPU, of course); this also exposes you to the risk of two different sequences having the same digest -the probability should be low, but not null...

I have no data to test and I never used Digest::MD5 directly, so take the subsequent code as a suggestion -it may suit your needs or be completely wrong. I'm looking at the documentation on http://search.cpan.org/doc/GAAS/Digest-MD5-2.20/MD5.pm

    use strict ;
    use warnings ; # ...if you have Perl 5.6

    # read from stdin, spit data to stdout
    # (just to keep it simple)
    use Digest::MD5 qw(md5_hex) ; # or one of md5*'s

    my %digests ;

    while (my $line = <STDIN>) {
      my $dig = md5_hex($line) ;
      if (exists $digests{$dig}) {
        print STDERR "WARNING: duplicated checksum $dig for line $line
+\nWARNING: skipping $line\n" ;
         $digest{$dig}++ ; # you can use this to count repetitions
      } else {
        $digest{$dig} = 0 ;
        print $line ;
      }
    }
[download]

If this not what you need, I hope that at least this can help you to reach the better solution.

--bronto

In reply to Re: finding and deleting repeated lines in a file by bronto
in thread finding and deleting repeated lines in a file by Becky

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl Monk, Perl Meditation
	PerlMonks