comment on

nurulnad:

Sorry, I guess I should've used $records_seen{$key}=1; or mentioned why I put $line in there. So here goes: For your application, you really only care whether the record exists or not, so the value stored in the hash doesn't matter. If they key exists, then we've seen the record before, otherwise we haven't.

I used $line because I was originally going to mention that you could use the hash to store the record contents for each record you kept. That way, you could print them in any order you wanted after reading the file, rather than writing them as you find them. However, that would only make things a bit more complex for no added value.

If you wanted to just store all the records and then print them in a particular order, you would change the program to something like this:

#!/usr/bin/perl
use strict;
use warnings;

# Records separated by blank line
$/ = "\n\n";

# Records we've seen before
my %records_seen;

while (my $line = <DATA>) {
   # Get list of key fields for record
   my @key_fields = (split /\s+/, $line)[
      -2, -8, -14, -5, -11, -17
   ];

   # Create composite key for record
   my $key = join("|",@key_fields);

   # Store record if we haven't seen it
   if (! exists $records_seen{$key}) {
      $records_seen{$key} = $line;
   }
}

# Print them in order
for my $key (sort keys %records_seen) {
        print $records_seen{$key};
}
__DATA__
A  83    GLU       A  90    GLU^?
A 163    ARG       A  83    ARG^?
A 222    ARG       A   5    ARG^?

A 229    ALA       A 115    ALA~?
A 257    ALA       A 118    ALA~?
A 328    ASP       A  95    ASP~?

A  83    GLU       A  90    GLU^?
A 163    ARG       A  83    ARG^?
A 222    ARG       A   5    ARG^?

A  83    GLU       B  90    GLU^?
A 163    ARG       B  83    ARG^?
A 222    ARG       B  5     ARG^?
[download]

Here, we just store the records in the loop and print nothing. Then, after reading the entire file, we sort the records and then print them. Running this program generates the same output as the earlier version.

The advantage of this method is that you can sort the records and print them in any order you like. The disadvantage is that since all the records are stored in memory, you can run out of memory (or reduce other programs performance) for very large files. For my machine and usual workload, processing files less than about a gigabyte is just fine. Your mileage may vary...

...roboticus

In reply to Re^3: delete redundant data by roboticus
in thread delete redundant data by nurulnad

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.