Removing partially duplicated lines from a file

Sandy_Bio_Perl has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks. I am trying to remove all lines where parts of the line are duplicated. In the file below, I would like to remove all 'duplicate' lines (leaving the first one) where column 2 (HLA) e.g. HLA-A*11:01 and column 3 (Peptide) e.g. YVNVNMGLK are the same. I would like to leave the other lines intact. I am a bit flummoxed, having tried and failed with regex!

----------------------------------------------------------------------
+-------------
  Pos          HLA         Peptide       Core Of Gp Gl Ip Il        Ic
+ore        Identity   Score Aff(nM)   %Rank  BindLevel
----------------------------------------------------------------------
+-------------
  117  HLA-A*11:01       YVNVNMGLK  YVNVNMGLK  0  0  0  0  0    YVNVNM
+GLK GQ924620_HBe_C_ 0.62268    59.3    0.40 <= SB
   28  HLA-A*11:01       WGMDIDPYK  WGMDIDPYK  0  0  0  0  0    WGMDID
+PYK GQ924620_HBe_C_ 0.44617   400.4    1.60 <= WB
  133  HLA-A*11:01       HISCLTFGR  HISCLTFGR  0  0  0  0  0    HISCLT
+FGR GQ924620_HBe_C_ 0.43660   444.0    1.70 <= WB
  
----------------------------------------------------------------------
+-------------
  Pos          HLA         Peptide       Core Of Gp Gl Ip Il        Ic
+ore        Identity   Score Aff(nM)   %Rank  BindLevel
----------------------------------------------------------------------
+-------------
   47  HLA-A*02:05       YVNVNMGLK  FLPSDFFPS  0  0  0  0  0    FLPSDF
+FPS X02763_HBe_A_po 0.77090    11.9    0.08 <= SB
   40  HLA-A*02:05       ATVELLSFL  ATVELLSFL  0  0  0  0  0    ATVELL
+SFL X02763_HBe_A_po 0.75279    14.5    0.10 <= SB
    1  HLA-A*02:05       MQLFHLCLI  MQLFHLCLI  0  0  0  0  0    MQLFHL
+CLI X02763_HBe_A_po 0.66669    36.8    0.30 <= SB
    9  HLA-A*02:05       IISCTCPTV  IISCTCPTV  0  0  0  0  0    IISCTC
+PTV X02763_HBe_A_po 0.52206   176.1    1.40 <= WB
  147  HLA-A*02:05       YLVSFGVWI  YLVSFGVWI  0  0  0  0  0    YLVSFG
+VWI X02763_HBe_A_po 0.51724   185.5    1.40 <= WB
   55  HLA-A*02:05       SVRDLLDTA  SVRDLLDTA  0  0  0  0  0    SVRDLL
+DTA X02763_HBe_A_po 0.49966   224.4    1.70 <= WB
  114  HLA-A*02:05       VVNYVNTNV  VVNYVNTNV  0  0  0  0  0    VVNYVN
+TNV X02763_HBe_A_po 0.48729   256.6    1.80 <= WB
   93  HLA-A*02:05       ELMTLATWV  ELMTLATWV  0  0  0  0  0    ELMTLA
+TWV X02763_HBe_A_po 0.46686   320.0    2.50
    8  HLA-A*02:05       LIISCTCPT  LIISCTCPT  0  0  0  0  0    LIISCT
+CPT X02763_HBe_A_po 0.45053   381.9    2.50

----------------------------------------------------------------------
+-------------
  Pos          HLA         Peptide       Core Of Gp Gl Ip Il        Ic
+ore        Identity   Score Aff(nM)   %Rank  BindLevel
----------------------------------------------------------------------
+-------------
  117  HLA-A*11:01       IISCTCPTV  YVNVNMGLK  0  0  0  0  0    YVNVNM
+GLK AB219428_HBe_B_ 0.62268    59.3    0.40 <= SB
   28  HLA-A*11:01       WGMDIDPYK  WGMDIDPYK  0  0  0  0  0    WGMDID
+PYK AB219428_HBe_B_ 0.44617   400.4    1.60 <= WB
  133  HLA-A*11:01       HISCLTFGR  HISCLTFGR  0  0  0  0  0    HISCLT
+FGR AB219428_HBe_B_ 0.43660   444.0    1.70 <= WB
[download]

Many thanks for any help you are able to give

Comment on Removing partially duplicated lines from a file Download Code

Replies are listed 'Best First'.
Re: Removing partially duplicated lines from a file by perldigious (Priest) on Jul 26, 2016 at 16:23 UTC
Hi Sandy_Bio_Perl, Try something like this: `#!/usr/bin/perl use warnings; use strict; open(my $in_fh, '<', 'input.txt') or die $!; open(my $out_fh, '>', 'output.txt') or die $!; my %seen_lines; while (<$in_fh>) { chomp; my @columns = split; if ($columns[1] and $columns[1] =~ /^HLA-A/) { my $HLA_Peptide = $columns[1] . $columns[2]; print $out_fh "$_\n" if (!exists $seen_lines{$HLA_Peptide}); $seen_lines{$HLA_Peptide} = 1; } else { print $out_fh "$_\n"; } } close $out_fh; close $in_fh;` [download] I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious	[reply] [d/l]
Re^2: Removing partially duplicated lines from a file by Sandy_Bio_Perl (Beadle) on Jul 26, 2016 at 21:08 UTC
Thank you. This works well, but I dont understand all your code. For example, why do we need to say `if ($columns[1] and $columns[1] =~ /^HLA-A/)` e.g. with the same reference used twice? Also, I would like to send the output to a variable and not print to a file. I know this should seem like a minor change to your great code, but I can't seem to make it work. Could you help please? (My novice level skills are showing)	[reply] [d/l]
Re^3: Removing partially duplicated lines from a file by perldigious (Priest) on Jul 26, 2016 at 21:28 UTC
The line of code you asked about basically says if `$columns[1]` is true (has any value Perl evaluates as true) and contains a string that begins with "HLA-A" then take the following actions. I included the first "does it have a true value" check because I assumed `use warnings;` would end up complaining for any line that didn't have an element at index 1 in `$columns`. I didn't actually try it without it, but I just assumed that would happen for at least the all "---" lines. As for the code changes you requested: `#!/usr/bin/perl use warnings; use strict; open(my $in_fh, '<', 'input.txt') or die $!; my $output; my %seen_lines; while (<$in_fh>) { chomp; my @columns = split; if ($columns[1] and $columns[1] =~ /^HLA-A/) { my $HLA_Peptide = $columns[1] . $columns[2]; $output .= "$_\n" if (!exists $seen_lines{$HLA_Peptide}); $seen_lines{$HLA_Peptide} = 1; } else { $output .= "$_\n"; } } close $in_fh; print $output;` [download] EDIT: I did just try it without that first check and I was correct, it does throw warnings without it. There may be a better way to avoid that warning (it does occur to me that false values like "0" or an empty string would be evaluated as such), but I use this trick a lot in an attempt to appease `use warnings;` or "-w". I wonder if there is something like `exists` which I use a lot for hashes only meant for use to check if an array element exists? I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious	[reply] [d/l] [select]
Re^4: Removing partially duplicated lines from a file by AnomalousMonk (Archbishop) on Jul 27, 2016 at 00:55 UTC
Re^5: Removing partially duplicated lines from a file by perldigious (Priest) on Jul 27, 2016 at 13:03 UTC
Some notes below your chosen depth have not been shown here
Re^4: Removing partially duplicated lines from a file by Sandy_Bio_Perl (Beadle) on Jul 26, 2016 at 21:35 UTC
Re^3: Removing partially duplicated lines from a file by harangzsolt33 (Deacon) on Jul 26, 2016 at 21:44 UTC
Okay. I am commenting here just because I thought of another way to solve this problem. What if you sort the lines before you try to eliminate the duplicates? That way the same lines will fall right next to each other, and you can just skip them by comparing this line to the previous line. If the two are the same, then you can skip that because it's a duplicate. This is a good idea if you don't expect to have a lot of duplicate lines and you plan to sort the output later on. Might as well sort it now and eliminate the duplicates in one step. ;-) `use strict; use warnings; my $ff = 'robots.txt'; my $fh; my @lines; # Read the entire file and # store lines in an array open $fh, "<", $ff or die "Sorry, can't open file - $ff\n"; { local $/; @lines = split("\n", <$fh>); } close $fh; # Get rid of duplicate lines @lines = sort(@lines); my $L; my $prev = ''; foreach $L (@lines) { print($L . "\n") if ($prev ne $L); $prev = $L; }` [download]	[reply] [d/l]
Re: Removing partially duplicated lines from a file by NetWallah (Canon) on Jul 26, 2016 at 16:39 UTC
Would this one-liner suffice ?: >perl -ane "next if $seen{$F[1].$F[2]}++;print" InputFile.txt ---------------------------------------------------------------------- +------------- Pos HLA Peptide Core Of Gp Gl Ip Il Ic +ore Identity Score Aff(nM) %Rank BindLevel 117 HLA-A11:01 YVNVNMGLK YVNVNMGLK 0 0 0 0 0 YVNVNM +GLK GQ924620_HBe_C_ 0.62268 59.3 0.40 <= SB 28 HLA-A11:01 WGMDIDPYK WGMDIDPYK 0 0 0 0 0 WGMDID +PYK GQ924620_HBe_C_ 0.44617 400.4 1.60 <= WB 133 HLA-A11:01 HISCLTFGR HISCLTFGR 0 0 0 0 0 HISCLT +FGR GQ924620_HBe_C_ 0.43660 444.0 1.70 <= WB 47 HLA-A02:05 YVNVNMGLK FLPSDFFPS 0 0 0 0 0 FLPSDF +FPS X02763_HBe_A_po 0.77090 11.9 0.08 <= SB 40 HLA-A02:05 ATVELLSFL ATVELLSFL 0 0 0 0 0 ATVELL +SFL X02763_HBe_A_po 0.75279 14.5 0.10 <= SB 1 HLA-A02:05 MQLFHLCLI MQLFHLCLI 0 0 0 0 0 MQLFHL +CLI X02763_HBe_A_po 0.66669 36.8 0.30 <= SB 9 HLA-A02:05 IISCTCPTV IISCTCPTV 0 0 0 0 0 IISCTC +PTV X02763_HBe_A_po 0.52206 176.1 1.40 <= WB 147 HLA-A02:05 YLVSFGVWI YLVSFGVWI 0 0 0 0 0 YLVSFG +VWI X02763_HBe_A_po 0.51724 185.5 1.40 <= WB 55 HLA-A02:05 SVRDLLDTA SVRDLLDTA 0 0 0 0 0 SVRDLL +DTA X02763_HBe_A_po 0.49966 224.4 1.70 <= WB 114 HLA-A02:05 VVNYVNTNV VVNYVNTNV 0 0 0 0 0 VVNYVN +TNV X02763_HBe_A_po 0.48729 256.6 1.80 <= WB 93 HLA-A02:05 ELMTLATWV ELMTLATWV 0 0 0 0 0 ELMTLA +TWV X02763_HBe_A_po 0.46686 320.0 2.50 8 HLA-A02:05 LIISCTCPT LIISCTCPT 0 0 0 0 0 LIISCT +CPT X02763_HBe_A_po 0.45053 381.9 2.50 117 HLA-A11:01 IISCTCPTV YVNVNMGLK 0 0 0 0 0 YVNVNM +GLK AB219428_HBe_B_ 0.62268 59.3 0.40 <= SB [download] UPDATE: If you want to maintain the headers, try this: `perl -ane "next unless not $seen{$F[1].$F[2]}++ or m/^\s\D/;print" I +nputFile.txt` [download] "Software interprets lawyers as damage, and routes around them" - Larry Wall	[reply] [d/l] [select]
Re: Removing partially duplicated lines from a file by Mandrake (Chaplain) on Jul 27, 2016 at 09:19 UTC
If I understood the question, something like the below should work. `#!/bin/perl -w use strict; open TMP, "input.txt" \|\| die ("could not open the input file \n"); my (%hash, @columns); for (<TMP>) { chomp; @columns = split ; next unless ($columns[1] && $columns[2]); if (not exists $hash{$columns[1].$columns[2]}) { $hash{$columns[1].$columns[2]}=1; print $_."\n" ; } } close TMP;` [download]	[reply] [d/l]
Re: Removing partially duplicated lines from a file by Anonymous Monk on Jul 26, 2016 at 15:53 UTC
I am a bit flummoxed, having tried and failed with regex! What did you try?	[reply]
Re^2: Removing partially duplicated lines from a file by Sandy_Bio_Perl (Beadle) on Jul 26, 2016 at 16:26 UTC
I am still working on it.... Trying to adapt some code I used earlier. It doesn't vaguely work! `sub removeDuplicatesFromOutputText { my $origfile = $_[0]; # eg. HLA-A_0_HBe_for_8_sids.txt; my %hTmp; my $outfile; my $tempout; open (IN, $origfile); while (my $line = <IN>) { if ($line =~ /^\s+\d+/){ next if $line =~ m/^\s*$/; } $line=~s/^\s+//; $line=~s/\s+$//; $tempout = qq{$line\n} unless ($hTmp{$line}++); $outfile .= $tempout; } return $outfile; }` [download]	[reply] [d/l]