Sandy_Bio_Perl has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks. I am trying to remove all lines where parts of the line are duplicated. In the file below, I would like to remove all 'duplicate' lines (leaving the first one) where column 2 (HLA) e.g. HLA-A*11:01 and column 3 (Peptide) e.g. YVNVNMGLK are the same. I would like to leave the other lines intact. I am a bit flummoxed, having tried and failed with regex!

---------------------------------------------------------------------- +------------- Pos HLA Peptide Core Of Gp Gl Ip Il Ic +ore Identity Score Aff(nM) %Rank BindLevel ---------------------------------------------------------------------- +------------- 117 HLA-A*11:01 YVNVNMGLK YVNVNMGLK 0 0 0 0 0 YVNVNM +GLK GQ924620_HBe_C_ 0.62268 59.3 0.40 <= SB 28 HLA-A*11:01 WGMDIDPYK WGMDIDPYK 0 0 0 0 0 WGMDID +PYK GQ924620_HBe_C_ 0.44617 400.4 1.60 <= WB 133 HLA-A*11:01 HISCLTFGR HISCLTFGR 0 0 0 0 0 HISCLT +FGR GQ924620_HBe_C_ 0.43660 444.0 1.70 <= WB ---------------------------------------------------------------------- +------------- Pos HLA Peptide Core Of Gp Gl Ip Il Ic +ore Identity Score Aff(nM) %Rank BindLevel ---------------------------------------------------------------------- +------------- 47 HLA-A*02:05 YVNVNMGLK FLPSDFFPS 0 0 0 0 0 FLPSDF +FPS X02763_HBe_A_po 0.77090 11.9 0.08 <= SB 40 HLA-A*02:05 ATVELLSFL ATVELLSFL 0 0 0 0 0 ATVELL +SFL X02763_HBe_A_po 0.75279 14.5 0.10 <= SB 1 HLA-A*02:05 MQLFHLCLI MQLFHLCLI 0 0 0 0 0 MQLFHL +CLI X02763_HBe_A_po 0.66669 36.8 0.30 <= SB 9 HLA-A*02:05 IISCTCPTV IISCTCPTV 0 0 0 0 0 IISCTC +PTV X02763_HBe_A_po 0.52206 176.1 1.40 <= WB 147 HLA-A*02:05 YLVSFGVWI YLVSFGVWI 0 0 0 0 0 YLVSFG +VWI X02763_HBe_A_po 0.51724 185.5 1.40 <= WB 55 HLA-A*02:05 SVRDLLDTA SVRDLLDTA 0 0 0 0 0 SVRDLL +DTA X02763_HBe_A_po 0.49966 224.4 1.70 <= WB 114 HLA-A*02:05 VVNYVNTNV VVNYVNTNV 0 0 0 0 0 VVNYVN +TNV X02763_HBe_A_po 0.48729 256.6 1.80 <= WB 93 HLA-A*02:05 ELMTLATWV ELMTLATWV 0 0 0 0 0 ELMTLA +TWV X02763_HBe_A_po 0.46686 320.0 2.50 8 HLA-A*02:05 LIISCTCPT LIISCTCPT 0 0 0 0 0 LIISCT +CPT X02763_HBe_A_po 0.45053 381.9 2.50 ---------------------------------------------------------------------- +------------- Pos HLA Peptide Core Of Gp Gl Ip Il Ic +ore Identity Score Aff(nM) %Rank BindLevel ---------------------------------------------------------------------- +------------- 117 HLA-A*11:01 IISCTCPTV YVNVNMGLK 0 0 0 0 0 YVNVNM +GLK AB219428_HBe_B_ 0.62268 59.3 0.40 <= SB 28 HLA-A*11:01 WGMDIDPYK WGMDIDPYK 0 0 0 0 0 WGMDID +PYK AB219428_HBe_B_ 0.44617 400.4 1.60 <= WB 133 HLA-A*11:01 HISCLTFGR HISCLTFGR 0 0 0 0 0 HISCLT +FGR AB219428_HBe_B_ 0.43660 444.0 1.70 <= WB

Many thanks for any help you are able to give

Replies are listed 'Best First'.
Re: Removing partially duplicated lines from a file
by perldigious (Priest) on Jul 26, 2016 at 16:23 UTC

    Hi Sandy_Bio_Perl,

    Try something like this:

    #!/usr/bin/perl use warnings; use strict; open(my $in_fh, '<', 'input.txt') or die $!; open(my $out_fh, '>', 'output.txt') or die $!; my %seen_lines; while (<$in_fh>) { chomp; my @columns = split; if ($columns[1] and $columns[1] =~ /^HLA-A/) { my $HLA_Peptide = $columns[1] . $columns[2]; print $out_fh "$_\n" if (!exists $seen_lines{$HLA_Peptide}); $seen_lines{$HLA_Peptide} = 1; } else { print $out_fh "$_\n"; } } close $out_fh; close $in_fh;

    I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
    I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious

      Thank you. This works well, but I dont understand all your code. For example, why do we need to say

      if ($columns[1] and $columns[1] =~ /^HLA-A/)

      e.g. with the same reference used twice? Also, I would like to send the output to a variable and not print to a file. I know this should seem like a minor change to your great code, but I can't seem to make it work. Could you help please? (My novice level skills are showing)

        The line of code you asked about basically says if $columns[1] is true (has any value Perl evaluates as true) and contains a string that begins with "HLA-A" then take the following actions. I included the first "does it have a true value" check because I assumed use warnings; would end up complaining for any line that didn't have an element at index 1 in $columns. I didn't actually try it without it, but I just assumed that would happen for at least the all "---" lines.

        As for the code changes you requested:

        #!/usr/bin/perl use warnings; use strict; open(my $in_fh, '<', 'input.txt') or die $!; my $output; my %seen_lines; while (<$in_fh>) { chomp; my @columns = split; if ($columns[1] and $columns[1] =~ /^HLA-A/) { my $HLA_Peptide = $columns[1] . $columns[2]; $output .= "$_\n" if (!exists $seen_lines{$HLA_Peptide}); $seen_lines{$HLA_Peptide} = 1; } else { $output .= "$_\n"; } } close $in_fh; print $output;

        EDIT: I did just try it without that first check and I was correct, it does throw warnings without it. There may be a better way to avoid that warning (it does occur to me that false values like "0" or an empty string would be evaluated as such), but I use this trick a lot in an attempt to appease use warnings; or "-w". I wonder if there is something like exists which I use a lot for hashes only meant for use to check if an array element exists?

        I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
        I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
        Okay. I am commenting here just because I thought of another way to solve this problem. What if you sort the lines before you try to eliminate the duplicates? That way the same lines will fall right next to each other, and you can just skip them by comparing this line to the previous line. If the two are the same, then you can skip that because it's a duplicate. This is a good idea if you don't expect to have a lot of duplicate lines and you plan to sort the output later on. Might as well sort it now and eliminate the duplicates in one step. ;-)
        use strict; use warnings; my $ff = 'robots.txt'; my $fh; my @lines; # Read the entire file and # store lines in an array open $fh, "<", $ff or die "Sorry, can't open file - $ff\n"; { local $/; @lines = split("\n", <$fh>); } close $fh; # Get rid of duplicate lines @lines = sort(@lines); my $L; my $prev = ''; foreach $L (@lines) { print($L . "\n") if ($prev ne $L); $prev = $L; }
Re: Removing partially duplicated lines from a file
by NetWallah (Canon) on Jul 26, 2016 at 16:39 UTC
    Would this one-liner suffice ?:
    >perl -ane "next if $seen{$F[1].$F[2]}++;print" InputFile.txt ---------------------------------------------------------------------- +------------- Pos HLA Peptide Core Of Gp Gl Ip Il Ic +ore Identity Score Aff(nM) %Rank BindLevel 117 HLA-A*11:01 YVNVNMGLK YVNVNMGLK 0 0 0 0 0 YVNVNM +GLK GQ924620_HBe_C_ 0.62268 59.3 0.40 <= SB 28 HLA-A*11:01 WGMDIDPYK WGMDIDPYK 0 0 0 0 0 WGMDID +PYK GQ924620_HBe_C_ 0.44617 400.4 1.60 <= WB 133 HLA-A*11:01 HISCLTFGR HISCLTFGR 0 0 0 0 0 HISCLT +FGR GQ924620_HBe_C_ 0.43660 444.0 1.70 <= WB 47 HLA-A*02:05 YVNVNMGLK FLPSDFFPS 0 0 0 0 0 FLPSDF +FPS X02763_HBe_A_po 0.77090 11.9 0.08 <= SB 40 HLA-A*02:05 ATVELLSFL ATVELLSFL 0 0 0 0 0 ATVELL +SFL X02763_HBe_A_po 0.75279 14.5 0.10 <= SB 1 HLA-A*02:05 MQLFHLCLI MQLFHLCLI 0 0 0 0 0 MQLFHL +CLI X02763_HBe_A_po 0.66669 36.8 0.30 <= SB 9 HLA-A*02:05 IISCTCPTV IISCTCPTV 0 0 0 0 0 IISCTC +PTV X02763_HBe_A_po 0.52206 176.1 1.40 <= WB 147 HLA-A*02:05 YLVSFGVWI YLVSFGVWI 0 0 0 0 0 YLVSFG +VWI X02763_HBe_A_po 0.51724 185.5 1.40 <= WB 55 HLA-A*02:05 SVRDLLDTA SVRDLLDTA 0 0 0 0 0 SVRDLL +DTA X02763_HBe_A_po 0.49966 224.4 1.70 <= WB 114 HLA-A*02:05 VVNYVNTNV VVNYVNTNV 0 0 0 0 0 VVNYVN +TNV X02763_HBe_A_po 0.48729 256.6 1.80 <= WB 93 HLA-A*02:05 ELMTLATWV ELMTLATWV 0 0 0 0 0 ELMTLA +TWV X02763_HBe_A_po 0.46686 320.0 2.50 8 HLA-A*02:05 LIISCTCPT LIISCTCPT 0 0 0 0 0 LIISCT +CPT X02763_HBe_A_po 0.45053 381.9 2.50 117 HLA-A*11:01 IISCTCPTV YVNVNMGLK 0 0 0 0 0 YVNVNM +GLK AB219428_HBe_B_ 0.62268 59.3 0.40 <= SB
    UPDATE: If you want to maintain the headers, try this:
    perl -ane "next unless not $seen{$F[1].$F[2]}++ or m/^\s*\D/;print" I +nputFile.txt

            "Software interprets lawyers as damage, and routes around them" - Larry Wall

Re: Removing partially duplicated lines from a file
by Mandrake (Chaplain) on Jul 27, 2016 at 09:19 UTC
    If I understood the question, something like the below should work.
    #!/bin/perl -w use strict; open TMP, "input.txt" || die ("could not open the input file \n"); my (%hash, @columns); for (<TMP>) { chomp; @columns = split ; next unless ($columns[1] && $columns[2]); if (not exists $hash{$columns[1].$columns[2]}) { $hash{$columns[1].$columns[2]}=1; print $_."\n" ; } } close TMP;
Re: Removing partially duplicated lines from a file
by Anonymous Monk on Jul 26, 2016 at 15:53 UTC

    I am a bit flummoxed, having tried and failed with regex!

    What did you try?

      I am still working on it.... Trying to adapt some code I used earlier. It doesn't vaguely work!

      sub removeDuplicatesFromOutputText { my $origfile = $_[0]; # eg. HLA-A_0_HBe_for_8_sids.txt; my %hTmp; my $outfile; my $tempout; open (IN, $origfile); while (my $line = <IN>) { if ($line =~ /^\s+\d+/){ next if $line =~ m/^\s*$/; } $line=~s/^\s+//; $line=~s/\s+$//; $tempout = qq{$line\n} unless ($hTmp{$line}++); $outfile .= $tempout; } return $outfile; }