delete redundant data

nurulnad has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: delete redundant data by roboticus (Chancellor) on Aug 21, 2010 at 12:31 UTC
nurulnad: Whenever you want to detect duplicates and/or unique values, one thing that should come to mind is the hash. Since it maps a key to a unique value. So in your case, you can just build a key for each incoming record. If the key does not exist in the hash, you process the record, ignoring it otherwise. Then be sure to enter the key into the hash. Here's a quick^[1] modification to your program to use a hash to detect and eliminate duplicate values: #!/usr/bin/perl use strict; use warnings; # Records separated by blank line $/ = "\r\n\r\n"; # Records we've seen before my %records_seen; while (my $line = <DATA>) { # Get list of key fields for record my @key_fields = (split /\s+/, $line)[ -2, -8, -14, -5, -11, -17 ]; # Create composite key for record my $key = join("\|",@key_fields); # Process the record if we haven't seen it before if (! exists $records_seen{$key}) { print $line; } # Remember that we've processed the record $records_seen{$key} = $line; } __DATA__ A 83 GLU A 90 GLU^? A 163 ARG A 83 ARG^? A 222 ARG A 5 ARG^? A 229 ALA A 115 ALA~? A 257 ALA A 118 ALA~? A 328 ASP A 95 ASP~? A 83 GLU A 90 GLU^? A 163 ARG A 83 ARG^? A 222 ARG A 5 ARG^? A 83 GLU B 90 GLU^? A 163 ARG B 83 ARG^? A 222 ARG B 5 ARG^? [download] Running this gives us: `Roboticus@Roboticus-PC /robo/Desktop $ perl 856427.pl A 83 GLU A 90 GLU^? A 163 ARG A 83 ARG^? A 222 ARG A 5 ARG^? A 229 ALA A 115 ALA~? A 257 ALA A 118 ALA~? A 328 ASP A 95 ASP~? Roboticus@Roboticus-PC /robo/Desktop $` [download] Here are some assorted notes on your code, to explain why there are so many differences between your code and mine: Indentation matters! Even in a program as simple as this one, using proper indentation makes the code easier to read and understand. If your editor doesn't let you easily handle indentation, you need to find and use a better editor! (Insert vim, emacs, etc. plug here.) Variable names matter! A variable name should make your program more readable. Names like $i and $j are useful in loops because it's a popular convention. Something like $key1, $key2, $key3, etc. would have been more useful than $i, $j, $k, etc. Having said that, though... Use collections for a collection of related data items instead of a set of parallel scalar values. This gives you the opportunity to use a better variable name and let you keep related variables together. So I renamed your $i, $j, ... $n to an array named @key_fields. If you kept your program as is except for this change, then you could have had a single assignment at the end of your while loop: `@prev_key_fields = @key_fields;` This would make your code shorter and easier to read by itself. Parsing can be costly, so I avoid `split`ting the same text multiple times. You could have merged your assignment statements into a single split like this: `my ($i_new, $j_new, $k_new, $l_new, $m_new, $n_new) = (split /\s+/,$line)[ -2, -8, -14, -5, -11, -17 ];` [download] This would still take two lines, is no harder to understand, and splits the line a single time. ...roboticus Update: Added this footnote: [1] Apparently not very quick, as baxy77bax posted a similar reply some 30 minutes earlier as I was composing this node... Update: Corrected "If the key exists in the hash" to "If the key does not exist in the hash" in the first paragraph.	[reply] [d/l] [select]
Re^2: delete redundant data by nurulnad (Acolyte) on Aug 21, 2010 at 17:01 UTC
Thank you so much. Your explanation is so simple and clear. Thanks for your time and effort.	[reply]
Re^2: delete redundant data by nurulnad (Acolyte) on Aug 22, 2010 at 00:20 UTC
Hey, sorry, I thought I got it, until I tried to write it again without referring to your script and found that I am confused about this line: `# Remember that we've processed the record $records_seen{$key} = $line;` [download] From your explanation I guess this is to "enter the key into the hash". But from what I understand, "$records_seen{$key} = $line" puts $line into the hash so that hash remembers when it sees that line next. However, I only want to compare the numericals and not the whole $line. I'm not understanding it right, right? I hope you can explain.	[reply] [d/l]
Re^3: delete redundant data by roboticus (Chancellor) on Aug 22, 2010 at 14:05 UTC
nurulnad: Sorry, I guess I should've used `$records_seen{$key}=1;` or mentioned why I put `$line` in there. So here goes: For your application, you really only care whether the record exists or not, so the value stored in the hash doesn't matter. If they key exists, then we've seen the record before, otherwise we haven't. I used `$line` because I was originally going to mention that you could use the hash to store the record contents for each record you kept. That way, you could print them in any order you wanted after reading the file, rather than writing them as you find them. However, that would only make things a bit more complex for no added value. If you wanted to just store all the records and then print them in a particular order, you would change the program to something like this: #!/usr/bin/perl use strict; use warnings; # Records separated by blank line $/ = "\n\n"; # Records we've seen before my %records_seen; while (my $line = <DATA>) { # Get list of key fields for record my @key_fields = (split /\s+/, $line)[ -2, -8, -14, -5, -11, -17 ]; # Create composite key for record my $key = join("\|",@key_fields); # Store record if we haven't seen it if (! exists $records_seen{$key}) { $records_seen{$key} = $line; } } # Print them in order for my $key (sort keys %records_seen) { print $records_seen{$key}; } __DATA__ A 83 GLU A 90 GLU^? A 163 ARG A 83 ARG^? A 222 ARG A 5 ARG^? A 229 ALA A 115 ALA~? A 257 ALA A 118 ALA~? A 328 ASP A 95 ASP~? A 83 GLU A 90 GLU^? A 163 ARG A 83 ARG^? A 222 ARG A 5 ARG^? A 83 GLU B 90 GLU^? A 163 ARG B 83 ARG^? A 222 ARG B 5 ARG^? [download] Here, we just store the records in the loop and print nothing. Then, after reading the entire file, we sort the records and then print them. Running this program generates the same output as the earlier version. The advantage of this method is that you can sort the records and print them in any order you like. The disadvantage is that since all the records are stored in memory, you can run out of memory (or reduce other programs performance) for very large files. For my machine and usual workload, processing files less than about a gigabyte is just fine. Your mileage may vary... ...roboticus	[reply] [d/l] [select]
Re: delete redundant data by Khen1950fx (Canon) on Aug 21, 2010 at 11:29 UTC
You could start with List::MoreUtils. Use the uniq method. For example, `#!/usr/bin/perl use strict; use warnings; use List::MoreUtils qw(uniq); my @lines = uniq(<DATA>); print @lines, "\n"; __DATA__ A 83 GLU A 90 GLU^? A 163 ARG A 83 ARG^? A 222 ARG A 5 ARG^? A 229 ALA A 115 ALA~? A 257 ALA A 118 ALA~? A 328 ASP A 95 ASP~? A 83 GLU A 90 GLU^? A 163 ARG A 83 ARG^? A 222 ARG A 5 ARG^? A 83 GLU B 90 GLU^? A 163 ARG B 83 ARG^? A 222 ARG B 5 ARG^?` [download]	[reply] [d/l]
Re: delete redundant data by johngg (Canon) on Aug 21, 2010 at 18:18 UTC
Another way to do it would be to read the data in paragraph mode so that each readline gets a record rather than a line. You could then extract all the numeric data from the record using a global regex, join the numbers with another character (to avoid false positives with 9 and 87 versus 98 and 7) then use the string as a hash key to eliminate duplicates. `use strict; use warnings; my %seen = (); { local $/ = q{}; # Paragraph mode while ( <DATA> ) { my $key = join q{:}, m{(\d+)}g; print unless $seen{ $key } ++; } } __DATA__ A 83 GLU A 90 GLU^? A 163 ARG A 83 ARG^? A 222 ARG A 5 ARG^? A 229 ALA A 115 ALA~? A 257 ALA A 118 ALA~? A 328 ASP A 95 ASP~? A 83 GLU A 90 GLU^? A 163 ARG A 83 ARG^? A 222 ARG A 5 ARG^? A 83 GLU B 90 GLU^? A 163 ARG B 83 ARG^? A 222 ARG B 5 ARG^?` [download] The output. `A 83 GLU A 90 GLU^? A 163 ARG A 83 ARG^? A 222 ARG A 5 ARG^? A 229 ALA A 115 ALA~? A 257 ALA A 118 ALA~? A 328 ASP A 95 ASP~?` [download] I hope this is helpful. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: delete redundant data by nurulnad (Acolyte) on Aug 22, 2010 at 04:49 UTC
Thank you. I already read my data as paragraph by putting `$/ = " ";` [download] which is sort of dumb but it works. thanks for pointing out the regex and presenting a different way than roboticus to use hash.	[reply] [d/l]
Re: delete redundant data by baxy77bax (Deacon) on Aug 21, 2010 at 11:45 UTC
don't think i understand what exactly you wish to delete, but if understood you correctly this is what you wish to achieve : convert this: `A 83 GLU A 90 GLU^? A 163 ARG A 83 ARG^? A 222 ARG A 5 ARG^? A 229 ALA A 115 ALA~? A 257 ALA A 118 ALA~? A 328 ASP A 95 ASP~? A 83 GLU A 90 GLU^? A 163 ARG A 83 ARG^? A 222 ARG A 5 ARG^? A 83 GLU B 90 GLU^? A 163 ARG B 83 ARG^? A 222 ARG B 5 ARG^?` [download] into this : `A 83 GLU B 90 GLU A 163 ARG B 83 ARG A 222 ARG B 5 ARG A 229 ALA A 115 ALA A 257 ALA A 118 ALA A 328 ASP A 95 ASP` [download] right ?? code : #!/usr/bin/perl use strict; my (%hash, %hash_key); my $x = 0; while (<DATA>){ my @array = split(' ', $_); $x++; $hash{"$array[1]-$array[4]"} = $_; $hash_key{$x} = "$array[1]-$array[4]"; } foreach my $i (sort {$a <=> $b} keys %hash_key){ (exists $hash{$hash_key{$i}}) ? (print "$hash{$hash_key{$i}}") : (pr +int "deleted\n"); delete($hash{$hash_key{$i}}) if (exists $hash{$hash_key{$i}}); } __DATA__ A 83 GLU A 90 GLU A 163 ARG A 83 ARG A 222 ARG A 5 ARG A 229 ALA A 115 ALA A 257 ALA A 118 ALA A 328 ASP A 95 ASP A 83 GLU A 90 GLU A 163 ARG A 83 ARG A 222 ARG A 5 ARG A 83 GLU B 90 GLU A 163 ARG B 83 ARG A 222 ARG B 5 ARG [download] baxy UPDATE: sorry i had to go as soon as i posted the reply (reason: girlfriend) here is a more elegant solution. the first has some bugs and limitations due to me being in a hurry ;) code : #!/usr/bin/perl use strict; my (%hash, %hash_key); # hashes my $x = 0; # counters while (<DATA>){ #starts reading the data line by line my @array = split(' ', $_); # split the data using spaces $x++; # global counter $hash{$array[1]}->{$array[4]} = $_; # primary database $hash_key{$x}= [$array[1],$array[4]]; # key database } foreach my $i (sort {$a <=> $b} keys %hash_key){ (exists $hash{$hash_key{$i}->[0]}->{$hash_key{$i}->[1]}) ? (print "$ +hash{$hash_key{$i}->[0]}->{$hash_key{$i}->[1]}") : (print "deleted\n" +); # if the record in the database (hash) exists print it out otherwi +se print 'deleted' next if ($hash_key{$i}->[0] eq ''); # you need the empty lines so if + you reached an empty line, skip the deleting part delete($hash{$hash_key{$i}->[0]}->{$hash_key{$i}->[1]}) if (exists $ +hash{$hash_key{$i}->[0]}->{$hash_key{$i}->[1]} \|\| $hash{$hash_key{$i} +->[1]}->{$hash_key{$i}->[0]}); # if you printed the entry from the da +tabase delete it , you don't need duplicates. this goes if your recor +d has 80 90 situation or 90 80 situation } __DATA__ A 83 GLU A 90 GLU A 163 ARG A 83 ARG A 222 ARG A 5 ARG A 229 ALA A 115 ALA A 257 ALA A 118 ALA A 328 ASP A 95 ASP A 83 GLU A 90 GLU A 163 ARG A 83 ARG A 222 ARG A 5 ARG A 83 GLU B 90 GLU A 163 ARG B 83 ARG A 222 ARG B 5 ARG [download] so what happenes... when you think about removing a duplicates think about hashes. so first hash is the actual database that withholds all he data and second one is the database that will preserve the order. once you hash your data all you have to do is print it in the order in which you saved it using the second hash_key. the deletion that follows is there so you don't print duplicates except if it is the blank space. you can remove the 'deleted' note if you don't need it. baxy ps also if you have any questions about the code , just shoot, example if you are not familiar with the : `($a ==1) ? (print "yes") : (print "no");` [download] since you stated that you are new to Perl and all...	[reply] [d/l] [select]
Re^2: delete redundant data by Anonymous Monk on Aug 21, 2010 at 12:01 UTC
Yes. To be honest your code seems a little intimidating to me. If you have free time, would you care to explain a little? Of course I'll google all these too. Anyway, thanks a lot for your time.	[reply]
Re: delete redundant data by MajingaZ (Beadle) on Aug 21, 2010 at 19:37 UTC
`open (my $INPUT, '<', 'bodo.txt'); open (my $OUTPUT, '>', 'output.txt'); my %seen; while (<$INPUT>) { my $line = $_; $line =~ /\A\D+(\d+)\D+(\d+)/; print $OUTPUT $line unless (defined($seen{"$1\t$2"}) and $line ne +"\n"); $seen{"$1\t$2"}++; }` [download] Not really clever, but pretty clean and hopefully easy to follow. I used $1\t\$2 as the key for %seen as I didn't think you wanted to Blah 83 Blah 90 to prevent Blah 8 Blah 390 from printing. Also wasn't sure if you wanted to keep the blank lines in there or not but thought I'd keep them. Could optimize the regex but prob would need better information on the dataset etc... Update: Also wanted to point out that you are assuming that the line of data will be unique against the numerical values in it. You could check store the line per unique value sets and then check for variations to see if it's been added. If that is how you need to go let us know cause we'll have to tweak our solutions for you.	[reply] [d/l]