Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello all fellow wisdom seekers

can you help me

my file contains duplicate pattern many times. i wish to remove all but one pattern in my output
file example

LOC_Os01g01010.1 : PS00022 EGF_1 EGF-like domain signature 1. 20 - 31 CtCtaAgaGAaC + L=(-1) 392 - 403 CtCccTtcGTtC + L=(-1) 740 - 751 CaCtaTtcGAgC + L=(-1) 905 - 916 CgCtgTtgGAtC + L=(-1) 1034 - 1045 CcCcgGtgGTgC + L=(-1) 2169 - 2180 CaCcgGgtGAaC + L=(-1) LOC_Os01g01010.1 : PS00099 THIOLASE_3 Thiolases active site. 26 - 39 GAGAACGAgAgAaG + L=(-1) 221 - 234 GACTACCGaAtAaG + L=(-1) 2732 - 2745 GAAAACAAgAgAcG + L=(-1) LOC_Os01g01010.1 : PS00197 2FE2S_FER_1 2Fe-2S ferredoxin-type iron-sul +fur binding region signature. 98 - 106 CGAGACGAC + L=(-1) 480 - 488 CAAGACAAC + L=(-1) 771 - 779 CTTGGCTGC + L=(-1) 976 - 984 CAAGTCAAC + L=(-1) 2314 - 2322 CAAGACATC + L=(-1) 2390 - 2398 CGTAGCAGC + L=(-1) LOC_Os01g01010.1 : PS00227 TUBULIN Tubulin subunits alpha, beta, and g +amma signature. 890 - 896 AGGTGAG + L=(-1) LOC_Os01g01010.1 : PS01177 ANAPHYLATOXIN_1 Anaphylatoxin domain signat +ure. 226 - 257 CCgaAtaagagaaGCAggc......AggCagacaaaCC + L=(-1) 264 - 296 CCaaGgagtcctcGCTgagg.....AagCtttggatCC + L=(-1) 362 - 396 CCtaGgtcgcat.GCAtcatcaga.TttCaatctc.CC + L=(-1) LOC_Os01g01010.1 : PS01185 CTCK_1 C-terminal cystine knot signature. 536 - 572 CCgtgcgggcggcgcCatGgccaacctccagCgCgg..C + L=(-1) LOC_Os01g01010.1 : PS01208 VWFC_1 VWFC domain signature. 557 - 614 CaacCTCcagcgcggcgttggCtcc.CtcgtccgtgaCattggcgacccctg..C +CtcaaC L=(-1) 578 - 623 CtccCTCgtccgtgacattggCgaccCctgc......Ctcaacccat.......C +Ccc..C L=(-1) LOC_Os01g01010.1 : PS50842 EXPANSIN_EG45 Expansin, family-45 endogluca +nase-like domain profile. 1624 - 1711 GGACACTGcaccgAATTGTGGTTGATGTGGTTAGAACGGATAGTCAtcttgATTT +CTATg L=-1 LOC_Os01g01010.2 : PS00022 EGF_1 EGF-like domain signature 1. 298 - 309 CtCccTtcGTtC + L=(-1) 646 - 657 CaCtaTtcGAgC + L=(-1) 811 - 822 CgCtgTtgGAtC + L=(-1) 940 - 951 CcCcgGtgGTgC + L=(-1) LOC_Os01g01010.2 : PS00099 THIOLASE_3 Thiolases active site. 140 - 153 GACTACCGaAtAaG + L=(-1) 2188 - 2201 GAAAACAAgAgAcG + L=(-1) LOC_Os01g01010.2 : PS00197 2FE2S_FER_1 2Fe-2S ferredoxin-type iron-sul +fur binding region signature. 17 - 25 CGAGACGAC + L=(-1) 386 - 394 CAAGACAAC + L=(-1) 677 - 685 CTTGGCTGC + L=(-1) 882 - 890 CAAGTCAAC + L=(-1) LOC_Os01g01010.2 : PS00227 TUBULIN Tubulin subunits alpha, beta, and g +amma signature. 796 - 802 AGGTGAG + L=(-1) LOC_Os01g01010.2 : PS01177 ANAPHYLATOXIN_1 Anaphylatoxin domain signat +ure. 145 - 176 CCgaAtaagagaaGCAggc......AggCagacaaaCC + L=(-1) 183 - 215 CCaaGgagtcctcGCTgagg.....AagCtttggatCC + L=(-1) LOC_Os01g01010.2 : PS01185 CTCK_1 C-terminal cystine knot signature. 442 - 478 CCgtgcgggcggcgcCatGgccaacctccagCgCgg..C + L=(-1) LOC_Os01g01010.2 : PS01208 VWFC_1 VWFC domain signature. 463 - 520 CaacCTCcagcgcggcgttggCtcc.CtcgtccgtgaCattggcgacccctg..C +CtcaaC L=(-1) 484 - 529 CtccCTCgtccgtgacattggCgaccCctgc......Ctcaacccat.......C +Ccc..C L=(-1)

This script i am using but it is not working as it seems to be. another script i am using is just giving me my input file back as output

#!/usr/local/bin/perl use strict; use warnings; open (FILE, "<:utf8", "outputps_scan_chr1_.out"); my @lines = <FILE>; my @uniq = (); my @waste = (); my %seen = (); foreach my $line (@lines) { my $pat = $line =~ m/^LOC_Os0[1-7]g[0-9]*.[0-9]\s/; if (!$seen{$pat}++) { push (@uniq, $line); my $new_uniq++; } else { push (@wastee, $line); } open (MYFILE, ">:utf8", "data.txt"); print MYFILE @uniq; open (WASTE, ">:utf8", "waste.txt"); print WASTE @waste; } close (MYFILE); close (WASTE); close (FILE);
plz help
thank u all for ur valuabe time

Replies are listed 'Best First'.
Re: pattern search then remove duplicacy
by hexcoder (Curate) on Jun 21, 2014 at 10:03 UTC
    The problem is that you use the matching operator in scalar context instead of list context. In scalar context it would return the number of matches (0 or 1).
    my $pat = $line =~ m/^LOC_Os0[1-7]g[0-9]*.[0-9]\s/;
    should be
    my ($pat) = $line =~ m/^LOC_Os0[1-7]g[0-9]*.[0-9]\s/;
    I would write the script more like this in order to have error handling and avoid rewriting the whole file for each new entry.
    #!/usr/local/bin/perl use strict; use warnings; use autodie; open (FILE, "<:utf8", "outputps_scan_chr1_.out"); my %seen = (); open (MYFILE, ">:utf8", "data.txt"); open (WASTE, ">:utf8", "waste.txt"); while (defined(my $line = <FILE>)) { my $pat; next if ($line !~ m/^(LOC_Os0[1-7]g[0-9]*.[0-9])\s/); $pat = $1; if (!$seen{$pat}++) { print MYFILE $line; } else { print WASTE $line; } } close (MYFILE); close (WASTE); close (FILE);
    Update: I forgot to mention that I changed the pattern matching also.

    First I want to know if there has been a match and then ignore the line, if there wasn't one.
    Instead of matching a second time to get the pattern, I used a capture (...) in the pattern. Then I can retrieve the matched string in $1 and assign it to $pat.

Re: pattern search then remove duplicacy
by RichardK (Parson) on Jun 21, 2014 at 10:41 UTC

    BTW, A single period in a regex matches any single character, so if you want to match just a period you'll need to escape that, '\.', or you might not get the results you're expecting.

Re: pattern search then remove duplicacy
by AppleFritter (Vicar) on Jun 21, 2014 at 09:46 UTC

    Well, it's not surprising your script isn't working, seeing as how you've got a typo here:

    push (@wastee, $line);

    In fact, using strict, this won't even compile.

    That aside, could you post an example of what the output (data.txt, anyway) of your script (with the typo fixed) is supposed to be given the example file you posted?

Re: pattern search then remove duplicacy
by Anonymous Monk on Jun 21, 2014 at 16:51 UTC
    I came to know that FAQ r not very supportive of thank u's. even then i thank u all for your help nd giving my prob your precious time. feeling very blessed and thankful. God bless u all