Re: Finding pattern in a file

Replies are listed 'Best First'.
Re^2: Finding pattern in a file by AnomalousMonk (Archbishop) on Apr 14, 2020 at 20:50 UTC
But IIUC, the pattern to be searched for may be broken across multiple lines. How would your code handle, e.g., the record: `>AAF88103.1 zinc finger protein 226 [Homo sapiens] AAAAAAAAAAAAAACDECGKEFSQ GAHLQTHQKVHZZZZZZZZZZZ` [download] (assuming we're now dealing with kosher FASTA files)?. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: Finding pattern in a file by tybalt89 (Monsignor) on Apr 14, 2020 at 23:08 UTC
Here's one way (snicker). It shows the location of each occurrence, even if they are overlapping. #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11115501 use warnings; $_ = do { local $/; <DATA> }; # or however you want to read the file my $pattern = 'CDECGKEFSQGAHLQTHQKVH' =~ s/\B/\n?/gr; print lc($`) . $1 . lc(substr $', length $1) . "\n" while /(?=($patter +n))/g; __DATA__ >AAF88103.1 zinc finger protein 226 [Homo sapiens] MNMFKEAVTFKDVAVAFTEEELGLLGPAXRKLYRDVMVENFRNLLSVGHPPFKQDVSPIERNEQLWIMTT ATRRQGNLGEKNQSKLITVQDRESEEELSCWQIWQQIANDLTRCQDSMINNSQCHKQGDFPYQVGTELSI QISEDENYIVNKADGPNNTGNPEFPILRTQDSWRKTFLTESQRLNRDQQISIKNKLCQCKKGVDPIGWIS HHDGHRVHKSEKSYRPNDYEKDNMKILTFDHNSMIHTGQKSYQCNECKKPFSDLSSFDLHQQLQSGEKSL TCVERGKGFCYSPVLPVHQKVHVGEKLKCDECGKEFSQGAHLQTHQKVHVIEKPYKCKQCGKGFSRRSAL NVHCKVHTAEKPYNCEECGRAFSQASHLQDHQRLHTGEKPFKCDACGKSFSRNSHLQSHQRVHTGEKPYK CEECGKGFICSSNLYIHQRVHTGEKPYKCEECGKGFSRPSSLQAHQGVHTGEKSYICTVCGKGFTLSSNL QAHQRVHTGEKPYKCNECGKSFRRNSHYQVHLVVHTGEKPYKCEICGKGFSQSSYLQIHQKAHSIEKPFK CEECGQGFNQSSRLQIHQLIHTGEKPYKCEECGKGFSRRADLKIHCRIHTGEKPYNCEECGKVFRQASNL LAHQRVHSGEKPFKCEECGKSFGRSAHLQAHQKVHTGDKPYKCDECGKGFKWSLNLDMHQRVHTGEKPYK CGECGKYFSQASSLQLHQSVHTGEKPYKCDVCGKVFSRSSQLQSHQRVHTGEKPYKCEICGKSFSWRSNL TVHHRIHVGDKSYKSNRGGKNIRESTQEKKSIK. AAAAAAAAAAAAAACDECGKEFSQ GAHLQTHQKVHZZZZZZZZZZZ [download]	[reply] [d/l]
Re^2: Finding pattern in a file by leszekdubiel (Scribe) on Apr 15, 2020 at 22:55 UTC
#!/usr/bin/perl -CSDA use utf8; use Modern::Perl; no warnings qw{uninitialized}; use Data::Dumper; use Path::Tiny; my $data = path('file,fasta')->slurp_utf8() =~ s/\s//mgr; warn $data; print $data =~ /$_/ ? "The protein contains the domain -- $_\n" : "The protein doesn't contain the domain -- $_\n" for 'KCKQCGKGFSRRSALNV', 'CGK', 'XXXX'; for my $lookfor (qw{CGK SQRLNR SQR PYKC PYKCK}) { pos $data = 0; while ($data =~ /$lookfor/gc) { print "there is $lookfor at ", (pos $data), "\n"; } } result: AAF88103.1zincfingerprotein226[Homosapiens]MNMFKEAVTFKDVAVAFTEEELGLLGP +AXRKLYRDVMVENFRNLLSVGHPPFKQDVSPIERNEQLWIMTTATRRQGNLGEKNQSKLITVQDRESEE +ELSCWQIWQQIANDLTRCQDSMINNSQCHKQGDFPYQVGTELSIQISEDENYIVNKADGPNNTGNPEFP +ILRTQDSWRKTFLTESQRLNRDQQISIKNKLCQCKKGVDPIGWISHHDGHRVHKSEKSYRPNDYEKDNM +KILTFDHNSMIHTGQKSYQCNECKKPFSDLSSFDLHQQLQSGEKSLTCVERGKGFCYSPVLPVHQKVHV +GEKLKCDECGKEFSQGAHLQTHQKVHVIEKPYKCKQCGKGFSRRSALNVHCKVHTAEKPYNCEECGRAF +SQASHLQDHQRLHTGEKPFKCDACGKSFSRNSHLQSHQRVHTGEKPYKCEECGKGFICSSNLYIHQRVH +TGEKPYKCEECGKGFSRPSSLQAHQGVHTGEKSYICTVCGKGFTLSSNLQAHQRVHTGEKPYKCNECGK +SFRRNSHYQVHLVVHTGEKPYKCEICGKGFSQSSYLQIHQKAHSIEKPFKCEECGQGFNQSSRLQIHQL +IHTGEKPYKCEECGKGFSRRADLKIHCRIHTGEKPYNCEECGKVFRQASNLLAHQRVHSGEKPFKCEEC +GKSFGRSAHLQAHQKVHTGDKPYKCDECGKGFKWSLNLDMHQRVHTGEKPYKCGECGKYFSQASSLQLH +QSVHTGEKPYKCDVCGKVFSRSSQLQSHQRVHTGEKPYKCEICGKSFSWRSNLTVHHRIHVGDKSYKSN +RGGKNIRESTQEKKSIK at ./a.pl line 10. The protein contains the domain -- KCKQCGKGFSRRSALNV The protein contains the domain -- CGK The protein doesn't contain the domain -- XXXX there is CGK at 357 there is CGK at 385 there is CGK at 441 there is CGK at 469 there is CGK at 497 there is CGK at 525 there is CGK at 553 there is CGK at 581 there is CGK at 637 there is CGK at 665 there is CGK at 693 there is CGK at 721 there is CGK at 749 there is CGK at 777 there is CGK at 805 there is SQRLNR at 229 there is SQR at 226 there is PYKC at 380 there is PYKC at 464 there is PYKC at 492 there is PYKC at 548 there is PYKC at 576 there is PYKC at 632 there is PYKC at 716 there is PYKC at 744 there is PYKC at 772 there is PYKC at 800 there is PYKCK at 381 [download]	[reply] [d/l]
Re^3: Finding pattern in a file by haukex (Archbishop) on Apr 16, 2020 at 20:28 UTC
I think this is better than the previous suggestion. As for `$lookfor` though, this can be implemented more efficiently with the solution shown in Building Regex Alternations Dynamically. By the way, I understand that `perl -CSDA` and `use Modern::Perl; no warnings qw{uninitialized};` are likely your standard boilerplate, but typically code examples should be as self-contained as possible, i.e. not depend on modules that aren't necessary, and also `no warnings` in general isn't really a best practice. Although I personally sometimes get annoyed by `uninitialized` warnings myself, it's still not something I would recommend to a newcomer as it can hide problems.	[reply] [d/l] [select]