Hello Monks. I am trying to remove all lines where parts of the line are duplicated. In the file below, I would like to remove all 'duplicate' lines (leaving the first one) where column 2 (HLA) e.g. HLA-A*11:01 and column 3 (Peptide) e.g. YVNVNMGLK are the same. I would like to leave the other lines intact. I am a bit flummoxed, having tried and failed with regex!
---------------------------------------------------------------------- +------------- Pos HLA Peptide Core Of Gp Gl Ip Il Ic +ore Identity Score Aff(nM) %Rank BindLevel ---------------------------------------------------------------------- +------------- 117 HLA-A*11:01 YVNVNMGLK YVNVNMGLK 0 0 0 0 0 YVNVNM +GLK GQ924620_HBe_C_ 0.62268 59.3 0.40 <= SB 28 HLA-A*11:01 WGMDIDPYK WGMDIDPYK 0 0 0 0 0 WGMDID +PYK GQ924620_HBe_C_ 0.44617 400.4 1.60 <= WB 133 HLA-A*11:01 HISCLTFGR HISCLTFGR 0 0 0 0 0 HISCLT +FGR GQ924620_HBe_C_ 0.43660 444.0 1.70 <= WB ---------------------------------------------------------------------- +------------- Pos HLA Peptide Core Of Gp Gl Ip Il Ic +ore Identity Score Aff(nM) %Rank BindLevel ---------------------------------------------------------------------- +------------- 47 HLA-A*02:05 YVNVNMGLK FLPSDFFPS 0 0 0 0 0 FLPSDF +FPS X02763_HBe_A_po 0.77090 11.9 0.08 <= SB 40 HLA-A*02:05 ATVELLSFL ATVELLSFL 0 0 0 0 0 ATVELL +SFL X02763_HBe_A_po 0.75279 14.5 0.10 <= SB 1 HLA-A*02:05 MQLFHLCLI MQLFHLCLI 0 0 0 0 0 MQLFHL +CLI X02763_HBe_A_po 0.66669 36.8 0.30 <= SB 9 HLA-A*02:05 IISCTCPTV IISCTCPTV 0 0 0 0 0 IISCTC +PTV X02763_HBe_A_po 0.52206 176.1 1.40 <= WB 147 HLA-A*02:05 YLVSFGVWI YLVSFGVWI 0 0 0 0 0 YLVSFG +VWI X02763_HBe_A_po 0.51724 185.5 1.40 <= WB 55 HLA-A*02:05 SVRDLLDTA SVRDLLDTA 0 0 0 0 0 SVRDLL +DTA X02763_HBe_A_po 0.49966 224.4 1.70 <= WB 114 HLA-A*02:05 VVNYVNTNV VVNYVNTNV 0 0 0 0 0 VVNYVN +TNV X02763_HBe_A_po 0.48729 256.6 1.80 <= WB 93 HLA-A*02:05 ELMTLATWV ELMTLATWV 0 0 0 0 0 ELMTLA +TWV X02763_HBe_A_po 0.46686 320.0 2.50 8 HLA-A*02:05 LIISCTCPT LIISCTCPT 0 0 0 0 0 LIISCT +CPT X02763_HBe_A_po 0.45053 381.9 2.50 ---------------------------------------------------------------------- +------------- Pos HLA Peptide Core Of Gp Gl Ip Il Ic +ore Identity Score Aff(nM) %Rank BindLevel ---------------------------------------------------------------------- +------------- 117 HLA-A*11:01 IISCTCPTV YVNVNMGLK 0 0 0 0 0 YVNVNM +GLK AB219428_HBe_B_ 0.62268 59.3 0.40 <= SB 28 HLA-A*11:01 WGMDIDPYK WGMDIDPYK 0 0 0 0 0 WGMDID +PYK AB219428_HBe_B_ 0.44617 400.4 1.60 <= WB 133 HLA-A*11:01 HISCLTFGR HISCLTFGR 0 0 0 0 0 HISCLT +FGR AB219428_HBe_B_ 0.43660 444.0 1.70 <= WB
Many thanks for any help you are able to give
In reply to Removing partially duplicated lines from a file by Sandy_Bio_Perl
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |