Hello Monks. I am trying to remove all lines where parts of the line are duplicated. In the file below, I would like to remove all 'duplicate' lines (leaving the first one) where column 2 (HLA) e.g. HLA-A*11:01 and column 3 (Peptide) e.g. YVNVNMGLK are the same. I would like to leave the other lines intact. I am a bit flummoxed, having tried and failed with regex!

---------------------------------------------------------------------- +------------- Pos HLA Peptide Core Of Gp Gl Ip Il Ic +ore Identity Score Aff(nM) %Rank BindLevel ---------------------------------------------------------------------- +------------- 117 HLA-A*11:01 YVNVNMGLK YVNVNMGLK 0 0 0 0 0 YVNVNM +GLK GQ924620_HBe_C_ 0.62268 59.3 0.40 <= SB 28 HLA-A*11:01 WGMDIDPYK WGMDIDPYK 0 0 0 0 0 WGMDID +PYK GQ924620_HBe_C_ 0.44617 400.4 1.60 <= WB 133 HLA-A*11:01 HISCLTFGR HISCLTFGR 0 0 0 0 0 HISCLT +FGR GQ924620_HBe_C_ 0.43660 444.0 1.70 <= WB ---------------------------------------------------------------------- +------------- Pos HLA Peptide Core Of Gp Gl Ip Il Ic +ore Identity Score Aff(nM) %Rank BindLevel ---------------------------------------------------------------------- +------------- 47 HLA-A*02:05 YVNVNMGLK FLPSDFFPS 0 0 0 0 0 FLPSDF +FPS X02763_HBe_A_po 0.77090 11.9 0.08 <= SB 40 HLA-A*02:05 ATVELLSFL ATVELLSFL 0 0 0 0 0 ATVELL +SFL X02763_HBe_A_po 0.75279 14.5 0.10 <= SB 1 HLA-A*02:05 MQLFHLCLI MQLFHLCLI 0 0 0 0 0 MQLFHL +CLI X02763_HBe_A_po 0.66669 36.8 0.30 <= SB 9 HLA-A*02:05 IISCTCPTV IISCTCPTV 0 0 0 0 0 IISCTC +PTV X02763_HBe_A_po 0.52206 176.1 1.40 <= WB 147 HLA-A*02:05 YLVSFGVWI YLVSFGVWI 0 0 0 0 0 YLVSFG +VWI X02763_HBe_A_po 0.51724 185.5 1.40 <= WB 55 HLA-A*02:05 SVRDLLDTA SVRDLLDTA 0 0 0 0 0 SVRDLL +DTA X02763_HBe_A_po 0.49966 224.4 1.70 <= WB 114 HLA-A*02:05 VVNYVNTNV VVNYVNTNV 0 0 0 0 0 VVNYVN +TNV X02763_HBe_A_po 0.48729 256.6 1.80 <= WB 93 HLA-A*02:05 ELMTLATWV ELMTLATWV 0 0 0 0 0 ELMTLA +TWV X02763_HBe_A_po 0.46686 320.0 2.50 8 HLA-A*02:05 LIISCTCPT LIISCTCPT 0 0 0 0 0 LIISCT +CPT X02763_HBe_A_po 0.45053 381.9 2.50 ---------------------------------------------------------------------- +------------- Pos HLA Peptide Core Of Gp Gl Ip Il Ic +ore Identity Score Aff(nM) %Rank BindLevel ---------------------------------------------------------------------- +------------- 117 HLA-A*11:01 IISCTCPTV YVNVNMGLK 0 0 0 0 0 YVNVNM +GLK AB219428_HBe_B_ 0.62268 59.3 0.40 <= SB 28 HLA-A*11:01 WGMDIDPYK WGMDIDPYK 0 0 0 0 0 WGMDID +PYK AB219428_HBe_B_ 0.44617 400.4 1.60 <= WB 133 HLA-A*11:01 HISCLTFGR HISCLTFGR 0 0 0 0 0 HISCLT +FGR AB219428_HBe_B_ 0.43660 444.0 1.70 <= WB

Many thanks for any help you are able to give


In reply to Removing partially duplicated lines from a file by Sandy_Bio_Perl

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.