Re: Removing partially duplicated lines from a file

Try something like this:

#!/usr/bin/perl
use warnings;
use strict;

open(my $in_fh, '<', 'input.txt') or die $!;
open(my $out_fh, '>', 'output.txt') or die $!;

my %seen_lines;
while (<$in_fh>)
{
    chomp;
    my @columns = split;
    
    if ($columns[1] and $columns[1] =~ /^HLA-A/)
    {
        my $HLA_Peptide = $columns[1] . $columns[2];
        print $out_fh "$_\n" if (!exists $seen_lines{$HLA_Peptide});
        $seen_lines{$HLA_Peptide} = 1;
    }
    else
    {
        print $out_fh "$_\n";
    }
}
close $out_fh;
close $in_fh;
[download]

I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious

Comment on Re: Removing partially duplicated lines from a file Download Code

Replies are listed 'Best First'.
Re^2: Removing partially duplicated lines from a file by Sandy_Bio_Perl (Beadle) on Jul 26, 2016 at 21:08 UTC
Thank you. This works well, but I dont understand all your code. For example, why do we need to say `if ($columns[1] and $columns[1] =~ /^HLA-A/)` e.g. with the same reference used twice? Also, I would like to send the output to a variable and not print to a file. I know this should seem like a minor change to your great code, but I can't seem to make it work. Could you help please? (My novice level skills are showing)	[reply] [d/l]
Re^3: Removing partially duplicated lines from a file by perldigious (Priest) on Jul 26, 2016 at 21:28 UTC
The line of code you asked about basically says if `$columns[1]` is true (has any value Perl evaluates as true) and contains a string that begins with "HLA-A" then take the following actions. I included the first "does it have a true value" check because I assumed `use warnings;` would end up complaining for any line that didn't have an element at index 1 in `$columns`. I didn't actually try it without it, but I just assumed that would happen for at least the all "---" lines. As for the code changes you requested: `#!/usr/bin/perl use warnings; use strict; open(my $in_fh, '<', 'input.txt') or die $!; my $output; my %seen_lines; while (<$in_fh>) { chomp; my @columns = split; if ($columns[1] and $columns[1] =~ /^HLA-A/) { my $HLA_Peptide = $columns[1] . $columns[2]; $output .= "$_\n" if (!exists $seen_lines{$HLA_Peptide}); $seen_lines{$HLA_Peptide} = 1; } else { $output .= "$_\n"; } } close $in_fh; print $output;` [download] EDIT: I did just try it without that first check and I was correct, it does throw warnings without it. There may be a better way to avoid that warning (it does occur to me that false values like "0" or an empty string would be evaluated as such), but I use this trick a lot in an attempt to appease `use warnings;` or "-w". I wonder if there is something like `exists` which I use a lot for hashes only meant for use to check if an array element exists? I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious	[reply] [d/l] [select]
Re^4: Removing partially duplicated lines from a file by AnomalousMonk (Archbishop) on Jul 27, 2016 at 00:55 UTC
... that first check ... a better way to avoid that warning ... something like exists ... defined is the way I would typically finesse this problem: `if (defined($columns[1]) && $columns[1] =~ /^HLA-A/) {` `...` `}` In the case of your posted code, the empty string and `'0'` will not, as you say, be tested against the regex, and in this particular case it will not matter because they cannot match anyway. In the general case, I think it's better to use `defined` because you can better avoid the "It'll never happen... Oh, it does happen..." situations that lead to those wonderful 3 AM debug sessions. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^5: Removing partially duplicated lines from a file by perldigious (Priest) on Jul 27, 2016 at 13:03 UTC
Re^6: Removing partially duplicated lines from a file by AnomalousMonk (Archbishop) on Jul 27, 2016 at 15:36 UTC
Some notes below your chosen depth have not been shown here
Re^4: Removing partially duplicated lines from a file by Sandy_Bio_Perl (Beadle) on Jul 26, 2016 at 21:35 UTC
Thank you Perldigious, I am very very grateful	[reply]
Re^3: Removing partially duplicated lines from a file by harangzsolt33 (Deacon) on Jul 26, 2016 at 21:44 UTC
Okay. I am commenting here just because I thought of another way to solve this problem. What if you sort the lines before you try to eliminate the duplicates? That way the same lines will fall right next to each other, and you can just skip them by comparing this line to the previous line. If the two are the same, then you can skip that because it's a duplicate. This is a good idea if you don't expect to have a lot of duplicate lines and you plan to sort the output later on. Might as well sort it now and eliminate the duplicates in one step. ;-) `use strict; use warnings; my $ff = 'robots.txt'; my $fh; my @lines; # Read the entire file and # store lines in an array open $fh, "<", $ff or die "Sorry, can't open file - $ff\n"; { local $/; @lines = split("\n", <$fh>); } close $fh; # Get rid of duplicate lines @lines = sort(@lines); my $L; my $prev = ''; foreach $L (@lines) { print($L . "\n") if ($prev ne $L); $prev = $L; }` [download]	[reply] [d/l]