Re: delete redundant data

nurulnad:

Whenever you want to detect duplicates and/or unique values, one thing that should come to mind is the hash. Since it maps a key to a unique value. So in your case, you can just build a key for each incoming record. If the key does not exist in the hash, you process the record, ignoring it otherwise. Then be sure to enter the key into the hash.

Here's a quick^[1] modification to your program to use a hash to detect and eliminate duplicate values:

#!/usr/bin/perl
use strict;
use warnings;

# Records separated by blank line
$/ = "\r\n\r\n";

# Records we've seen before
my %records_seen;

while (my $line = <DATA>) {
   # Get list of key fields for record
   my @key_fields = (split /\s+/, $line)[
      -2, -8, -14, -5, -11, -17
   ];

   # Create composite key for record
   my $key = join("|",@key_fields);

   # Process the record if we haven't seen it before
   if (! exists $records_seen{$key}) {
      print $line;
   }

   # Remember that we've processed the record
   $records_seen{$key} = $line;
}
__DATA__
A  83    GLU       A  90    GLU^?
A 163    ARG       A  83    ARG^?
A 222    ARG       A   5    ARG^?

A 229    ALA       A 115    ALA~?
A 257    ALA       A 118    ALA~?
A 328    ASP       A  95    ASP~?

A  83    GLU       A  90    GLU^?
A 163    ARG       A  83    ARG^?
A 222    ARG       A   5    ARG^?

A  83    GLU       B  90    GLU^?
A 163    ARG       B  83    ARG^?
A 222    ARG       B  5     ARG^?
[download]

Running this gives us:

Roboticus@Roboticus-PC /robo/Desktop
$ perl 856427.pl
A  83    GLU       A  90    GLU^?
A 163    ARG       A  83    ARG^?
A 222    ARG       A   5    ARG^?

A 229    ALA       A 115    ALA~?
A 257    ALA       A 118    ALA~?
A 328    ASP       A  95    ASP~?

Roboticus@Roboticus-PC /robo/Desktop
$
[download]

Here are some assorted notes on your code, to explain why there are so many differences between your code and mine:

Indentation matters! Even in a program as simple as this one, using proper indentation makes the code easier to read and understand. If your editor doesn't let you easily handle indentation, you need to find and use a better editor! (Insert vim, emacs, etc. plug here.)
Variable names matter! A variable name should make your program more readable. Names like $i and $j are useful in loops because it's a popular convention. Something like $key1, $key2, $key3, etc. would have been more useful than $i, $j, $k, etc. Having said that, though...
Use collections for a collection of related data items instead of a set of parallel scalar values. This gives you the opportunity to use a better variable name *and* let you keep related variables together. So I renamed your $i, $j, ... $n to an array named @key_fields. If you kept your program as is except for this change, then you could have had a single assignment at the end of your while loop:
@prev_key_fields = @key_fields;
This would make your code shorter and easier to read by itself.
Parsing can be costly, so I avoid splitting the same text multiple times. You could have merged your assignment statements into a single split like this:
```
  my ($i_new, $j_new, $k_new, $l_new, $m_new, $n_new)
    = (split /\s+/,$line)[ -2, -8, -14, -5, -11, -17 ];
[download]
```
This would still take two lines, is no harder to understand, and splits the line a single time.

...roboticus

Update: Added this footnote:

[1] Apparently not *very* quick, as baxy77bax posted a similar reply some 30 minutes earlier as I was composing this node...

Update: Corrected "If the key exists in the hash" to "If the key does not exist in the hash" in the first paragraph.

Comment on Re: delete redundant data Select or Download Code

Replies are listed 'Best First'.
Re^2: delete redundant data by nurulnad (Acolyte) on Aug 21, 2010 at 17:01 UTC
Thank you so much. Your explanation is so simple and clear. Thanks for your time and effort.	[reply]
Re^2: delete redundant data by nurulnad (Acolyte) on Aug 22, 2010 at 00:20 UTC
Hey, sorry, I thought I got it, until I tried to write it again without referring to your script and found that I am confused about this line: `# Remember that we've processed the record $records_seen{$key} = $line;` [download] From your explanation I guess this is to "enter the key into the hash". But from what I understand, "$records_seen{$key} = $line" puts $line into the hash so that hash remembers when it sees that line next. However, I only want to compare the numericals and not the whole $line. I'm not understanding it right, right? I hope you can explain.	[reply] [d/l]
Re^3: delete redundant data by roboticus (Chancellor) on Aug 22, 2010 at 14:05 UTC
nurulnad: Sorry, I guess I should've used `$records_seen{$key}=1;` or mentioned why I put `$line` in there. So here goes: For your application, you really only care whether the record exists or not, so the value stored in the hash doesn't matter. If they key exists, then we've seen the record before, otherwise we haven't. I used `$line` because I was originally going to mention that you could use the hash to store the record contents for each record you kept. That way, you could print them in any order you wanted after reading the file, rather than writing them as you find them. However, that would only make things a bit more complex for no added value. If you wanted to just store all the records and then print them in a particular order, you would change the program to something like this: #!/usr/bin/perl use strict; use warnings; # Records separated by blank line $/ = "\n\n"; # Records we've seen before my %records_seen; while (my $line = <DATA>) { # Get list of key fields for record my @key_fields = (split /\s+/, $line)[ -2, -8, -14, -5, -11, -17 ]; # Create composite key for record my $key = join("\|",@key_fields); # Store record if we haven't seen it if (! exists $records_seen{$key}) { $records_seen{$key} = $line; } } # Print them in order for my $key (sort keys %records_seen) { print $records_seen{$key}; } __DATA__ A 83 GLU A 90 GLU^? A 163 ARG A 83 ARG^? A 222 ARG A 5 ARG^? A 229 ALA A 115 ALA~? A 257 ALA A 118 ALA~? A 328 ASP A 95 ASP~? A 83 GLU A 90 GLU^? A 163 ARG A 83 ARG^? A 222 ARG A 5 ARG^? A 83 GLU B 90 GLU^? A 163 ARG B 83 ARG^? A 222 ARG B 5 ARG^? [download] Here, we just store the records in the loop and print nothing. Then, after reading the entire file, we sort the records and then print them. Running this program generates the same output as the earlier version. The advantage of this method is that you can sort the records and print them in any order you like. The disadvantage is that since all the records are stored in memory, you can run out of memory (or reduce other programs performance) for very large files. For my machine and usual workload, processing files less than about a gigabyte is just fine. Your mileage may vary... ...roboticus	[reply] [d/l] [select]