shabird has asked for the wisdom of the Perl Monks concerning the following question:

I have a file given below:

AAF88103.1 zinc finger protein 226 [Homo sapiens]MNMFKEAVTFKDVAVAFTEEE +LGLLGPAXRKLYRD VMVENFRNLLSVGHPPFKQDVSPIERNEQLWIMTTATRRQGNLGEKNQSKLITVQDRESEEELSCWQIWQ +QIANDLT RCQDSMINNSQCHKQGDFPYQVGTELSIQISEDENYIVNKADGPNNTGNPEFPILRTQDSWRKTFLTESQ +RLNRDQQI SIKNKLCQCKKGVDPIGWISHHDGHRVHKSEKSYRPNDYEKDNMKILTFDHNSMIHTGQKSYQCNECKKP +FSDLSSF DLHQQLQSGEKSLTCVERGKGFCYSPVLPVHQKVHVGEKLKCDECGKEFSQGAHLQTHQKVHVIEKPYKC +KQCGKGF SRRSALNVHCKVHTAEKPYNCEECGRAFSQASHLQDHQRLHTGEKPFKCDACGKSFSRNSHLQSHQRVHT +GEKPYK CEECGKGFICSSNLYIHQRVHTGEKPYKCEECGKGFSRPSSLQAHQGVHTGEKSYICTVCGKGFTLSSNL +QAHQRVHTG EKPYKCNECGKSFRRNSHYQVHLVVHTGEKPYKCEICGKGFSQSSYLQIHQKAHSIEKPFKCEECGQGFN +QSSRLQIHQL IHTGEKPYKCEECGKGFSRRADLKIHCRIHTGEKPYNCEECGKVFRQASNLLAHQRVHSGEKPFKCEECG +KSFGRSAHLQ AHQKVHTGDKPYKCDECGKGFKWSLNLDMHQRVHTGEKPYKCGECGKYFSQASSLQLHQSVHTGEKPYKC +DVCGKVF SRSSQLQSHQRVHTGEKPYKCEICGKSFSWRSNLTVHHRIHVGDKSYKSNRGGKNIRESTQEKKSIK

I want to look for a pattern i.e: (CDECGKEFSQGAHLQTHQKVH) which should be hardcoded in the program like the code i have written below:

open FILE1, "file,fasta" or die; while (my $line=<FILE1>) { chomp $line; } if ($line =~ /CDECGKEFSQGAHLQTHQKVH/) { print "The protein contains the domain"; }else{ print "The protein doesn't contain the domain"; }

Is this the correct way of doing it?

Replies are listed 'Best First'.
Re: Finding pattern in a file (updated x2)
by AnomalousMonk (Archbishop) on Apr 14, 2020 at 07:08 UTC

    Update 2: As revealed in the crosspost linked here, the true format of the OPed example file seems to be standard FASTA, so the correct approach is along the lines used by choroba here.


    Is this the correct way of doing it?

    No.

    You should always have the statements
        use strict;
        use warnings;
    at the start of your code (see strict and warnings). If you place these statements at the beginning of the OPed code, what happens? Some questions:

    • In the  while (my $line=<FILE1>) { ... } loop, what is the scope of the  $line variable?
    • In the  if ($line =~ /CDECGKEFSQGAHLQTHQKVH/) { ... } conditional, where was  $line ever defined/declared? Is this expression valid under strict?
    Please see Basic debugging checklist.

    Maybe try something like:

    use strict; use warnings; use autodie; use constant START_FASTA_REC => qr{ \A [^\]]+ \] }xms; use constant SUB_SEQUENCE => qr{ CDECGKEFSQGAHLQTHQKVH }xms; # use constant SUB_SEQUENCE => qr{ NOT_PRESENT }xms; # for debug MAIN { my $filename = 'file,fasta'; open my $fh_fasta, '<', $filename; my $fasta_record; LINE: while (my $line = <$fh_fasta>) { chomp $line; if ($line =~ s{ ${ \START_FASTA_REC }}''xms) { process_fasta_record($fasta_record) if defined $fasta_record and length $fasta_record; $fasta_record = $line; next LINE; } $fasta_record .= $line; } close $fh_fasta; # process final fasta record. process_fasta_record($fasta_record) if defined $fasta_record and length $fasta_record; exit; # normal exit from MAIN block } # end MAIN block die "unexpected exit from MAIN block"; # subroutines ###################################################### sub process_fasta_record { my ($fasta_record, ) = @_; print "'$fasta_record' \n"; # for debug if ($fasta_record =~ SUB_SEQUENCE) { print "The protein contains the domain"; } else { print "The protein doesn't contain the domain"; } }
    Note: This has not been tested with multiple FASTA records, only with the single record example given in the OP.

    Update 1: (Update: Nevermind. See Update 2 above.) I have a slightly simpler version of this script that I've tested (minimally) with a two-record FASTA file. Please let me know if you're interested.


    Give a man a fish:  <%-{-{-{-<

Re: Finding pattern in a file
by swl (Prior) on Apr 14, 2020 at 06:29 UTC

    It's close, although your code does not specify the mode to open the file with, and has the if condition outside the while loop, so won't work.

    It's also better to use lexical file handles.

    Untested modified code is below. I've also done some minor formatting changes, but avoided too many as it can become a matter of aesthetics.

    my $fname = "file,fasta"; open my $fh, '<', $fname or die "Cannot open $fname, $!"; while (my $line = <$fh>) { chomp $line; if ($line =~ /CDECGKEFSQGAHLQTHQKVH/) { print "The protein contains the domain"; } else { print "The protein doesn't contain the domain"; } }

    See also open and perlvar.

Re: Finding pattern in a file
by choroba (Cardinal) on Apr 14, 2020 at 14:36 UTC
    If you are searching for an exact substring, using index might be faster than using a regex.

    Also, your sample input here doesn't show the > character at the beginning of the first line, but the post at StackOveflow does (It also seems you concatenated the first and second line here). You can use it to separate the records as it doesn't appear anywhere else:

    #! /usr/bin/perl use strict; use warnings; use feature qw{ say }; open my $in, '<', shift or die $!; local $/ = '>'; while (my $record = <$in>) { next if '>' eq $record; # Skip the "0th empty record". my ($header, $data) = split /\n/, $record, 2; $data =~ s/\n//g; say "Found in $header" if -1 != index $data, 'CDECGKEFSQGAHLQTHQKV +H'; }

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Finding pattern in a file
by leszekdubiel (Scribe) on Apr 14, 2020 at 18:15 UTC
    #!/usr/bin/perl -CSDA use utf8; use Modern::Perl; no warnings qw{uninitialized}; use Data::Dumper; use Path::Tiny; print /CDECGKEFSQGAHLQTHQKVH/ ? "The protein contains the domain\n" : "The protein doesn't contain the domain\n" for path('file,fasta')->lines_utf8();

      But IIUC, the pattern to be searched for may be broken across multiple lines. How would your code handle, e.g., the record:

      >AAF88103.1 zinc finger protein 226 [Homo sapiens] AAAAAAAAAAAAAACDECGKEFSQ GAHLQTHQKVHZZZZZZZZZZZ
      (assuming we're now dealing with kosher FASTA files)?.


      Give a man a fish:  <%-{-{-{-<

        Here's one way (snicker).
        It shows the location of each occurrence, even if they are overlapping.

        #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11115501 use warnings; $_ = do { local $/; <DATA> }; # or however you want to read the file my $pattern = 'CDECGKEFSQGAHLQTHQKVH' =~ s/\B/\n?/gr; print lc($`) . $1 . lc(substr $', length $1) . "\n" while /(?=($patter +n))/g; __DATA__ >AAF88103.1 zinc finger protein 226 [Homo sapiens] MNMFKEAVTFKDVAVAFTEEELGLLGPAXRKLYRDVMVENFRNLLSVGHPPFKQDVSPIERNEQLWIMTT ATRRQGNLGEKNQSKLITVQDRESEEELSCWQIWQQIANDLTRCQDSMINNSQCHKQGDFPYQVGTELSI QISEDENYIVNKADGPNNTGNPEFPILRTQDSWRKTFLTESQRLNRDQQISIKNKLCQCKKGVDPIGWIS HHDGHRVHKSEKSYRPNDYEKDNMKILTFDHNSMIHTGQKSYQCNECKKPFSDLSSFDLHQQLQSGEKSL TCVERGKGFCYSPVLPVHQKVHVGEKLKCDECGKEFSQGAHLQTHQKVHVIEKPYKCKQCGKGFSRRSAL NVHCKVHTAEKPYNCEECGRAFSQASHLQDHQRLHTGEKPFKCDACGKSFSRNSHLQSHQRVHTGEKPYK CEECGKGFICSSNLYIHQRVHTGEKPYKCEECGKGFSRPSSLQAHQGVHTGEKSYICTVCGKGFTLSSNL QAHQRVHTGEKPYKCNECGKSFRRNSHYQVHLVVHTGEKPYKCEICGKGFSQSSYLQIHQKAHSIEKPFK CEECGQGFNQSSRLQIHQLIHTGEKPYKCEECGKGFSRRADLKIHCRIHTGEKPYNCEECGKVFRQASNL LAHQRVHSGEKPFKCEECGKSFGRSAHLQAHQKVHTGDKPYKCDECGKGFKWSLNLDMHQRVHTGEKPYK CGECGKYFSQASSLQLHQSVHTGEKPYKCDVCGKVFSRSSQLQSHQRVHTGEKPYKCEICGKSFSWRSNL TVHHRIHVGDKSYKSNRGGKNIRESTQEKKSIK. AAAAAAAAAAAAAACDECGKEFSQ GAHLQTHQKVHZZZZZZZZZZZ
      #!/usr/bin/perl -CSDA use utf8; use Modern::Perl; no warnings qw{uninitialized}; use Data::Dumper; use Path::Tiny; my $data = path('file,fasta')->slurp_utf8() =~ s/\s//mgr; warn $data; print $data =~ /$_/ ? "The protein contains the domain -- $_\n" : "The protein doesn't contain the domain -- $_\n" for 'KCKQCGKGFSRRSALNV', 'CGK', 'XXXX'; for my $lookfor (qw{CGK SQRLNR SQR PYKC PYKCK}) { pos $data = 0; while ($data =~ /$lookfor/gc) { print "there is $lookfor at ", (pos $data), "\n"; } } result: AAF88103.1zincfingerprotein226[Homosapiens]MNMFKEAVTFKDVAVAFTEEELGLLGP +AXRKLYRDVMVENFRNLLSVGHPPFKQDVSPIERNEQLWIMTTATRRQGNLGEKNQSKLITVQDRESEE +ELSCWQIWQQIANDLTRCQDSMINNSQCHKQGDFPYQVGTELSIQISEDENYIVNKADGPNNTGNPEFP +ILRTQDSWRKTFLTESQRLNRDQQISIKNKLCQCKKGVDPIGWISHHDGHRVHKSEKSYRPNDYEKDNM +KILTFDHNSMIHTGQKSYQCNECKKPFSDLSSFDLHQQLQSGEKSLTCVERGKGFCYSPVLPVHQKVHV +GEKLKCDECGKEFSQGAHLQTHQKVHVIEKPYKCKQCGKGFSRRSALNVHCKVHTAEKPYNCEECGRAF +SQASHLQDHQRLHTGEKPFKCDACGKSFSRNSHLQSHQRVHTGEKPYKCEECGKGFICSSNLYIHQRVH +TGEKPYKCEECGKGFSRPSSLQAHQGVHTGEKSYICTVCGKGFTLSSNLQAHQRVHTGEKPYKCNECGK +SFRRNSHYQVHLVVHTGEKPYKCEICGKGFSQSSYLQIHQKAHSIEKPFKCEECGQGFNQSSRLQIHQL +IHTGEKPYKCEECGKGFSRRADLKIHCRIHTGEKPYNCEECGKVFRQASNLLAHQRVHSGEKPFKCEEC +GKSFGRSAHLQAHQKVHTGDKPYKCDECGKGFKWSLNLDMHQRVHTGEKPYKCGECGKYFSQASSLQLH +QSVHTGEKPYKCDVCGKVFSRSSQLQSHQRVHTGEKPYKCEICGKSFSWRSNLTVHHRIHVGDKSYKSN +RGGKNIRESTQEKKSIK at ./a.pl line 10. The protein contains the domain -- KCKQCGKGFSRRSALNV The protein contains the domain -- CGK The protein doesn't contain the domain -- XXXX there is CGK at 357 there is CGK at 385 there is CGK at 441 there is CGK at 469 there is CGK at 497 there is CGK at 525 there is CGK at 553 there is CGK at 581 there is CGK at 637 there is CGK at 665 there is CGK at 693 there is CGK at 721 there is CGK at 749 there is CGK at 777 there is CGK at 805 there is SQRLNR at 229 there is SQR at 226 there is PYKC at 380 there is PYKC at 464 there is PYKC at 492 there is PYKC at 548 there is PYKC at 576 there is PYKC at 632 there is PYKC at 716 there is PYKC at 744 there is PYKC at 772 there is PYKC at 800 there is PYKCK at 381

        I think this is better than the previous suggestion. As for $lookfor though, this can be implemented more efficiently with the solution shown in Building Regex Alternations Dynamically.

        By the way, I understand that perl -CSDA and use Modern::Perl; no warnings qw{uninitialized}; are likely your standard boilerplate, but typically code examples should be as self-contained as possible, i.e. not depend on modules that aren't necessary, and also no warnings in general isn't really a best practice. Although I personally sometimes get annoyed by uninitialized warnings myself, it's still not something I would recommend to a newcomer as it can hide problems.

Re: Finding pattern in a file
by Anonymous Monk on Apr 14, 2020 at 14:15 UTC