pattern search then remove duplicacy

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello all fellow wisdom seekers

can you help me

my file contains duplicate pattern many times. i wish to remove all but one pattern in my output
file example

LOC_Os01g01010.1 : PS00022 EGF_1 EGF-like domain signature 1.
      20 - 31  CtCtaAgaGAaC                                           
+      L=(-1)
    392 - 403  CtCccTtcGTtC                                           
+      L=(-1)
    740 - 751  CaCtaTtcGAgC                                           
+      L=(-1)
    905 - 916  CgCtgTtgGAtC                                           
+      L=(-1)
  1034 - 1045  CcCcgGtgGTgC                                           
+      L=(-1)
  2169 - 2180  CaCcgGgtGAaC                                           
+      L=(-1)
LOC_Os01g01010.1 : PS00099 THIOLASE_3 Thiolases active site.
      26 - 39  GAGAACGAgAgAaG                                         
+      L=(-1)
    221 - 234  GACTACCGaAtAaG                                         
+      L=(-1)
  2732 - 2745  GAAAACAAgAgAcG                                         
+      L=(-1)
LOC_Os01g01010.1 : PS00197 2FE2S_FER_1 2Fe-2S ferredoxin-type iron-sul
+fur binding region signature.
     98 - 106  CGAGACGAC                                              
+      L=(-1)
    480 - 488  CAAGACAAC                                              
+      L=(-1)
    771 - 779  CTTGGCTGC                                              
+      L=(-1)
    976 - 984  CAAGTCAAC                                              
+      L=(-1)
  2314 - 2322  CAAGACATC                                              
+      L=(-1)
  2390 - 2398  CGTAGCAGC                                              
+      L=(-1)
LOC_Os01g01010.1 : PS00227 TUBULIN Tubulin subunits alpha, beta, and g
+amma signature.
    890 - 896  AGGTGAG                                                
+      L=(-1)
LOC_Os01g01010.1 : PS01177 ANAPHYLATOXIN_1 Anaphylatoxin domain signat
+ure.
    226 - 257  CCgaAtaagagaaGCAggc......AggCagacaaaCC                 
+      L=(-1)
    264 - 296  CCaaGgagtcctcGCTgagg.....AagCtttggatCC                 
+      L=(-1)
    362 - 396  CCtaGgtcgcat.GCAtcatcaga.TttCaatctc.CC                 
+      L=(-1)
LOC_Os01g01010.1 : PS01185 CTCK_1 C-terminal cystine knot signature.
    536 - 572  CCgtgcgggcggcgcCatGgccaacctccagCgCgg..C                
+      L=(-1)
LOC_Os01g01010.1 : PS01208 VWFC_1 VWFC domain signature.
    557 - 614  CaacCTCcagcgcggcgttggCtcc.CtcgtccgtgaCattggcgacccctg..C
+CtcaaC L=(-1)
    578 - 623  CtccCTCgtccgtgacattggCgaccCctgc......Ctcaacccat.......C
+Ccc..C L=(-1)
LOC_Os01g01010.1 : PS50842 EXPANSIN_EG45 Expansin, family-45 endogluca
+nase-like domain profile.
  1624 - 1711  GGACACTGcaccgAATTGTGGTTGATGTGGTTAGAACGGATAGTCAtcttgATTT
+CTATg L=-1
LOC_Os01g01010.2 : PS00022 EGF_1 EGF-like domain signature 1.
    298 - 309  CtCccTtcGTtC                                           
+      L=(-1)
    646 - 657  CaCtaTtcGAgC                                           
+      L=(-1)
    811 - 822  CgCtgTtgGAtC                                           
+      L=(-1)
    940 - 951  CcCcgGtgGTgC                                           
+      L=(-1)
LOC_Os01g01010.2 : PS00099 THIOLASE_3 Thiolases active site.
    140 - 153  GACTACCGaAtAaG                                         
+      L=(-1)
  2188 - 2201  GAAAACAAgAgAcG                                         
+      L=(-1)
LOC_Os01g01010.2 : PS00197 2FE2S_FER_1 2Fe-2S ferredoxin-type iron-sul
+fur binding region signature.
      17 - 25  CGAGACGAC                                              
+      L=(-1)
    386 - 394  CAAGACAAC                                              
+      L=(-1)
    677 - 685  CTTGGCTGC                                              
+      L=(-1)
    882 - 890  CAAGTCAAC                                              
+      L=(-1)
LOC_Os01g01010.2 : PS00227 TUBULIN Tubulin subunits alpha, beta, and g
+amma signature.
    796 - 802  AGGTGAG                                                
+      L=(-1)
LOC_Os01g01010.2 : PS01177 ANAPHYLATOXIN_1 Anaphylatoxin domain signat
+ure.
    145 - 176  CCgaAtaagagaaGCAggc......AggCagacaaaCC                 
+      L=(-1)
    183 - 215  CCaaGgagtcctcGCTgagg.....AagCtttggatCC                 
+      L=(-1)
LOC_Os01g01010.2 : PS01185 CTCK_1 C-terminal cystine knot signature.
    442 - 478  CCgtgcgggcggcgcCatGgccaacctccagCgCgg..C                
+      L=(-1)
LOC_Os01g01010.2 : PS01208 VWFC_1 VWFC domain signature.
    463 - 520  CaacCTCcagcgcggcgttggCtcc.CtcgtccgtgaCattggcgacccctg..C
+CtcaaC L=(-1)
    484 - 529  CtccCTCgtccgtgacattggCgaccCctgc......Ctcaacccat.......C
+Ccc..C L=(-1)
[download]

This script i am using but it is not working as it seems to be. another script i am using is just giving me my input file back as output

#!/usr/local/bin/perl
use strict;
use warnings;

open (FILE,  "<:utf8", "outputps_scan_chr1_.out");
my @lines = <FILE>;
my @uniq = ();
my @waste = ();
my %seen = ();

foreach my $line (@lines) 
{
    my $pat = $line =~ m/^LOC_Os0[1-7]g[0-9]*.[0-9]\s/;
    if (!$seen{$pat}++)
    {
        push (@uniq, $line);
           my $new_uniq++;
        }
    else
    { 
        push (@wastee, $line);
        }
open (MYFILE, ">:utf8", "data.txt");
print MYFILE @uniq;
open (WASTE, ">:utf8", "waste.txt");
print WASTE @waste;
}
close (MYFILE);
close (WASTE);
close (FILE);
[download]

plz help
thank u all for ur valuabe time

Comment on pattern search then remove duplicacy Select or Download Code

Replies are listed 'Best First'.
Re: pattern search then remove duplicacy by hexcoder (Curate) on Jun 21, 2014 at 10:03 UTC
The problem is that you use the matching operator in scalar context instead of list context. In scalar context it would return the number of matches (0 or 1). `my $pat = $line =~ m/^LOC_Os0[1-7]g[0-9].[0-9]\s/;` [download] should be `my ($pat) = $line =~ m/^LOC_Os0[1-7]g[0-9].[0-9]\s/;` [download] I would write the script more like this in order to have error handling and avoid rewriting the whole file for each new entry. `#!/usr/local/bin/perl use strict; use warnings; use autodie; open (FILE, "<:utf8", "outputps_scan_chr1_.out"); my %seen = (); open (MYFILE, ">:utf8", "data.txt"); open (WASTE, ">:utf8", "waste.txt"); while (defined(my $line = <FILE>)) { my $pat; next if ($line !~ m/^(LOC_Os0[1-7]g[0-9]*.[0-9])\s/); $pat = $1; if (!$seen{$pat}++) { print MYFILE $line; } else { print WASTE $line; } } close (MYFILE); close (WASTE); close (FILE);` [download] Update: I forgot to mention that I changed the pattern matching also. First I want to know if there has been a match and then ignore the line, if there wasn't one. Instead of matching a second time to get the pattern, I used a capture `(...)` in the pattern. Then I can retrieve the matched string in `$1` and assign it to `$pat`.	[reply] [d/l] [select]
Re: pattern search then remove duplicacy by RichardK (Parson) on Jun 21, 2014 at 10:41 UTC
BTW, A single period in a regex matches any single character, so if you want to match just a period you'll need to escape that, '\.', or you might not get the results you're expecting.	[reply]
Re: pattern search then remove duplicacy by AppleFritter (Vicar) on Jun 21, 2014 at 09:46 UTC
Well, it's not surprising your script isn't working, seeing as how you've got a typo here: `push (@wastee, $line);` [download] In fact, `using strict`, this won't even compile. That aside, could you post an example of what the output (`data.txt`, anyway) of your script (with the typo fixed) is supposed to be given the example file you posted?	[reply] [d/l]
Re: pattern search then remove duplicacy by Anonymous Monk on Jun 21, 2014 at 16:51 UTC
I came to know that FAQ r not very supportive of thank u's. even then i thank u all for your help nd giving my prob your precious time. feeling very blessed and thankful. God bless u all	[reply]