Extracting string and numbers from a file

shabird has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks! Hope you are all fine, Monks, to my knowledge i am applying the following regex in my code to extract only id's i.e: (first part of each line NM_030643.4,NR_029834.1,NM_001198855.1, AC067940.1)of the file. but it is returning this "APOL4) CYP2C8)" instead.

The file:

 
>NM_030643.4 Homo sapiens apolipoprotein L4 (APOL4) 
GAGGTGCTGGGGAGCAGCGTGTTTGCTGTGCTTGATTGTGAGCTGCTGGGAAGTTGTGACTTTCATTTTA
CCTTTCGAATTCCTGGGTATATCTTGGGGGCTGGAGGACGTGTCTGGTTATTATATAGGTGCACAGCTGG
>NM_001198855.1 Homo sapiens cytochrome P450 family 2 subfamily C memb
+er 8 (CYP2C8)
ACATGTCAAAGAGACACACAC
>NR_029834.1 Homo sapiens microRNA 200a (MIR200A), microRNA
CCGGGCCCCTGTGAGCATC
>AC067940.1 Homo sapiens clone RP11-818E9, LOW-PASS SEQUENCE SAMPLING
AAATACAACTTTAAATCAAAACGGTAAAAATTCCACTCTTTCATACTAACTTCAAAAGTATTTGCTTTAA
AAAAAAAGNNNNNNNNN
[download]

open(GENBANK, "/Users/Desktop/Genes.fasta") or die; 
my $content = join("", <GENBANK>); 
close(GENBANK);

sub mysub{
return shift =~ /(\w+\W+)\n/g;
 
}
my @matches = mysub($content);
print "@matches\n";
[download]

Comment on Extracting string and numbers from a file Select or Download Code

Replies are listed 'Best First'.

Re: Extracting string and numbers from a file
by choroba (Cardinal) on Apr 29, 2020 at 13:50 UTC

  /(\w+\W+)\n/
#          ~~
#          ^
#          |
#        newline
[download]

How could it return anything else?

You probably need something like

/>(\S+)/g;
[download]

I.e. match > followed by non-whitespace that gets captured.

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

[reply]
[d/l]
[select]

Re^2: Extracting string and numbers from a file

by shabird (Sexton) on Apr 29, 2020 at 17:26 UTC

That helped thank you so much

[reply]

Re^3: Extracting string and numbers from a file

by perlfan (Parson) on May 12, 2020 at 03:23 UTC

"APOL4) CYP2C8)"

Interesting result nonetheless.

[reply]
[d/l]

Re: Extracting string and numbers from a file
by Athanasius (Archbishop) on Apr 29, 2020 at 13:48 UTC

Hello shabird,

Your regex says: match one or more word characters, followed by one or more non-word characters, followed immediately by a newline; and return the characters matched minus the newline. This won’t work.

What you need is a way to uniquely identify the IDs. From the file contents shown, it looks as though each ID is immediately preceded by a > character. If so, you could use something like this:

#! perl

use strict;
use warnings;
use Data::Dump;

my   @matches;
push @matches, mysub($_) for <DATA>;
dd  \@matches;

sub mysub
{
    return shift =~ / > (\S+) \s /gx;
}

__DATA__
 
>NM_030643.4 Homo sapiens apolipoprotein L4 (APOL4) 
GAGGTGCTGGGGAGCAGCGTGTTTGCTGTGCTTGATTGTGAGCTGCTGGGAAGTTGTGACTTTCATTTTA
CCTTTCGAATTCCTGGGTATATCTTGGGGGCTGGAGGACGTGTCTGGTTATTATATAGGTGCACAGCTGG
>NM_001198855.1 Homo sapiens cytochrome P450 family 2 subfamily C memb
+er 8 (CYP2C8)
ACATGTCAAAGAGACACACAC
>NR_029834.1 Homo sapiens microRNA 200a (MIR200A), microRNA
CCGGGCCCCTGTGAGCATC
>AC067940.1 Homo sapiens clone RP11-818E9, LOW-PASS SEQUENCE SAMPLING
AAATACAACTTTAAATCAAAACGGTAAAAATTCCACTCTTTCATACTAACTTCAAAAGTATTTGCTTTAA
AAAAAAAGNNNNNNNNN
[download]

Output:

23:46 >perl 2038_SoPW.pl
["NM_030643.4", "NM_001198855.1", "NR_029834.1", "AC067940.1"]

23:46 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,