shabird has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks! Hope you are all fine, Monks, to my knowledge i am applying the following regex in my code to extract only id's i.e: (first part of each line NM_030643.4,NR_029834.1,NM_001198855.1, AC067940.1)of the file. but it is returning this "APOL4) CYP2C8)" instead.

The file:

>NM_030643.4 Homo sapiens apolipoprotein L4 (APOL4) GAGGTGCTGGGGAGCAGCGTGTTTGCTGTGCTTGATTGTGAGCTGCTGGGAAGTTGTGACTTTCATTTTA CCTTTCGAATTCCTGGGTATATCTTGGGGGCTGGAGGACGTGTCTGGTTATTATATAGGTGCACAGCTGG >NM_001198855.1 Homo sapiens cytochrome P450 family 2 subfamily C memb +er 8 (CYP2C8) ACATGTCAAAGAGACACACAC >NR_029834.1 Homo sapiens microRNA 200a (MIR200A), microRNA CCGGGCCCCTGTGAGCATC >AC067940.1 Homo sapiens clone RP11-818E9, LOW-PASS SEQUENCE SAMPLING AAATACAACTTTAAATCAAAACGGTAAAAATTCCACTCTTTCATACTAACTTCAAAAGTATTTGCTTTAA AAAAAAAGNNNNNNNNN
open(GENBANK, "/Users/Desktop/Genes.fasta") or die; my $content = join("", <GENBANK>); close(GENBANK); sub mysub{ return shift =~ /(\w+\W+)\n/g; } my @matches = mysub($content); print "@matches\n";

Replies are listed 'Best First'.
Re: Extracting string and numbers from a file
by choroba (Cardinal) on Apr 29, 2020 at 13:50 UTC
    The regex matches before a newline:
    /(\w+\W+)\n/ # ~~ # ^ # | # newline

    How could it return anything else?

    You probably need something like

    />(\S+)/g;

    I.e. match > followed by non-whitespace that gets captured.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      That helped thank you so much

        "APOL4) CYP2C8)"

        Interesting result nonetheless.

Re: Extracting string and numbers from a file
by Athanasius (Archbishop) on Apr 29, 2020 at 13:48 UTC

    Hello shabird,

    Your regex says: match one or more word characters, followed by one or more non-word characters, followed immediately by a newline; and return the characters matched minus the newline. This won’t work.

    What you need is a way to uniquely identify the IDs. From the file contents shown, it looks as though each ID is immediately preceded by a > character. If so, you could use something like this:

    #! perl use strict; use warnings; use Data::Dump; my @matches; push @matches, mysub($_) for <DATA>; dd \@matches; sub mysub { return shift =~ / > (\S+) \s /gx; } __DATA__ >NM_030643.4 Homo sapiens apolipoprotein L4 (APOL4) GAGGTGCTGGGGAGCAGCGTGTTTGCTGTGCTTGATTGTGAGCTGCTGGGAAGTTGTGACTTTCATTTTA CCTTTCGAATTCCTGGGTATATCTTGGGGGCTGGAGGACGTGTCTGGTTATTATATAGGTGCACAGCTGG >NM_001198855.1 Homo sapiens cytochrome P450 family 2 subfamily C memb +er 8 (CYP2C8) ACATGTCAAAGAGACACACAC >NR_029834.1 Homo sapiens microRNA 200a (MIR200A), microRNA CCGGGCCCCTGTGAGCATC >AC067940.1 Homo sapiens clone RP11-818E9, LOW-PASS SEQUENCE SAMPLING AAATACAACTTTAAATCAAAACGGTAAAAATTCCACTCTTTCATACTAACTTCAAAAGTATTTGCTTTAA AAAAAAAGNNNNNNNNN

    Output:

    23:46 >perl 2038_SoPW.pl ["NM_030643.4", "NM_001198855.1", "NR_029834.1", "AC067940.1"] 23:46 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Works pretty fine.. Thank you!