Regex / "Sophisticated" End of Line

nofutur45 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I would like to ask for your help on the following regex. I tried out a few matching expressions, however, they did not generate the desired output.

I have a text file full of lines as follows:

&#160; ACADM, Homo sapiensacyl-Coenzyme A dehydrogenase, C-4 to C-12 s
+traight chain     NP_001120800.1425 aa 
&#160; ACADM, Pan troglodytesacyl-Coenzyme A dehydrogenase, C-4 to C-1
+2 straight chain     XP_524741.2453 aa 
&#160; ACADM, Canis lupus familiarisacyl-Coenzyme A dehydrogenase, C-4
+ to C-12 straight chain     XP_547328.2444 aa 
&#160; ACADM, Bos taurusacyl-coenzyme A dehydrogenase, C-4 to C-12 str
+aight chain     NP_001068703.1421 aa 
&#160; Acadm, Mus musculusacyl-Coenzyme A dehydrogenase, medium chain 
+    NP_031408.1421 aa 
&#160; Acadm, Rattus norvegicusacyl-Coenzyme A dehydrogenase, C-4 to C
+-12 straight chain     NP_058682.14
&#160; ACADVL, Homo sapiensacyl-Coenzyme A dehydrogenase, very long ch
+ain     NP_000009.1655 aa 
&#160; ACADVL, Canis lupus familiarisacyl-Coenzyme A dehydrogenase, ve
+ry long chain     XP_546581.2650 aa 
&#160; ACADVL, Bos taurusacyl-coenzyme A dehydrogenase, very long chai
+n     NP_776919.1655 aa 
&#160; Acadvl, Mus musculusacyl-Coenzyme A dehydrogenase, very long ch
+ain     NP_059062.1656 aa 
&#160; Acadvl, Rattus norvegicusacyl-Coenzyme A dehydrogenase, very lo
+ng chain     NP_037023.1655 aa 
&#160; acadvl, Danio rerioacyl-Coenzyme A dehydrogenase, very long cha
+in     NP_997776.1659 aa 
&#160; CG7461, Drosophila melanogasterCG7461     NP_611409.1655 aa 
&#160; AgaP_AGAP008769, Anopheles gambiaeAGAP008769-PA     XP_314893.4
+624 aa 
&#160; acdh-12, Caenorhabditis elegansAcyl CoA DeHydrogenase     NP_00
+1022062.1613 aa
&#160; ALAS2, Homo sapiensaminolevulinate, delta-, synthase 2     NP_0
+00023.2587 aa 
&#160; ALAS2, Pan troglodytesaminolevulinate, delta-, synthase 2     X
+P_001147099.1583 aa 
&#160; ALAS2, Canis lupus familiarisaminolevulinate, delta-, synthase 
+2     XP_548619.2587 aa 
&#160; ALAS2, Bos taurusaminolevulinate, delta-, synthase 2     NP_001
+030275.1587 aa 
&#160; Alas2, Mus musculusaminolevulinic acid synthase 2, erythroid   
+  NP_033783.1587 aa 
&#160; Alas2, Rattus norvegicusaminolevulinate, delta-, synthase 2    
+ XP_001062452.1267 aa 
&#160; alas2, Danio rerioaminolevulinate, delta-, synthetase 2     NP_
+571757.1583 aa 
&#160; hem1, Schizosaccharomyces pombe5-aminolevulinate synthase     N
+P_594388.1558 aa 
&#160; HEM1, Saccharomyces cerevisiae5-aminolevulinate synthase, catal
+yzes the first step in the heme biosynthetic pathway; an N-terminal s
+ignal sequence is required for localization to the mitochondrial matr
+ix; expression is regulated by Hap2p-Hap3p     NP_010518.1548 aa 
&#160; HEM1_KLULA, Kluyveromyces lactishypothetical protein     XP_452
+875.1570 aa 
&#160; AGOS_ABL104C, Eremothecium gossypiiABL104Cp     NP_982843.1556 
+aa 
&#160; MGG_06446, Magnaporthe grisea5-aminolevulinate synthase, mitoch
+ondrial precursor     XP_369931.2615 aa 
&#160; NCU06189.1, Neurospora crassahypothetical protein ( (AB071862) 
+5-aminolevulinate synthase [Gibberella fujikuroi] )     XP_326044.154
+4 aa
[download]

I need to extract the NM_ / XP_ combination for only Homo sapiens. I reasoned that I should first make sure that the selected line includes Homo sapiens, and then pick up the combination by matching backwards. So I came up with the following expression:

elsif($line =~ /^\S+\s+\S+\s+Homo sapiens(\S+)\s\S+$/)

The part between Homo sapiens and the combination varies in length, particularly in number of words. Therefore, there is no unique matching expression that would fit all possible inbetween text.

I thought this expression would work since I "guessed" it would confirm matches sufficiently as to search from the beginning until the end of sapiens and to search from the end until the last characters of sapiens. However, it does not work.

I would be very happy if you can help me with the use of $ to capture not the last string but a particular string starting backwards.

Thank you

Comment on Regex / "Sophisticated" End of Line Select or Download Code

Replies are listed 'Best First'.
Re: Regex / "Sophisticated" End of Line by kcott (Archbishop) on Oct 27, 2010 at 03:44 UTC
You were almost there. This works: `$ perl -wE 'while (<>) {if (m{ \A \S+ \s+ \S+ \s+ Homo \s+ sapiens .*? + ([NX] \S+) \s+ \S+ \s+ \z }msx) {say $1}}' < 867580.dat NP_001120800.1425 NP_000009.1655 NP_000023.2587` [download] There's some whitespace at the end of the lines so your final `\S+$` wasn't matching. -- Ken	[reply] [d/l] [select]
Re: Regex / "Sophisticated" End of Line by zwon (Abbot) on Oct 27, 2010 at 01:07 UTC
I don't exactly understand what are you trying to do. Could you tell us what do you want to extract from the line `  ACADM, Homo sapiensacyl-Coenzyme A dehydrogenase, C-4 to C-12 s +traight chain NP_001120800.1425 aa` [download] Is it `acyl-Coenzyme` or `NP_001120800.1425`?	[reply] [d/l] [select]
Re^2: Regex / "Sophisticated" End of Line by nofutur45 (Initiate) on Oct 27, 2010 at 01:17 UTC
Hi zwon, Thank you for your reply. In your example, it would be `NP_001120800.1425` . The following script by james2vegas - with a little modification - solved my problem: `elsif (m/, Homo sapiens/) { my ($human) = m/((?:XP_\|NP_)[\d. ]+)\s+/; $human = $1; print OUTFILE $1 . "\t";` [download] So it works in two parts: First get the line which includes ", Homo sapiens", and then look for the NP_ or XP_ combination I was curious if there is a way to pick up, say, the second "compact word" (I do not know how to say this properly, but a series of non-space characters) from the end. Another kind user provided an answer to this question, where in his solution he uses split by a space character, put the elements into an array and pick up the -2nd element, which is the second from the end. Thank you for your help, guys.	[reply] [d/l] [select]
Re: Regex / "Sophisticated" End of Line by aquarium (Curate) on Oct 27, 2010 at 02:46 UTC
It's easier to read/modify/extend if you separate your matching requirements, rather than a convoluted look behind or look ahead expression...untested `if($line=~/Homo sapiens/) { if($line=~/((NP_\|XP_))([^ ]+)/) { $matched=$1 . $2; ...do other stuff... } }` [download] or similar the hardest line to type correctly is: stty erase ^H	[reply] [d/l]
Re: Regex / "Sophisticated" End of Line by morgon (Priest) on Oct 27, 2010 at 01:22 UTC
Not sure if I undestand correctly what you try to do, but I assume that you want to extract the words that either start with "NM_" or with "XP_" for all lines that contain "Homo sapiens". If that is what you want it is easy: `while (my $line=<$fh>) { # assuming a loop through your file here my ($nm_xp) = $line =~ /Homo sapiens.\s(NM_\w\|XP_\w*)/; next unless $nm_xp; print "$nm_xp for a great ape found at line $.\n"; }` [download]	[reply] [d/l]
Re: Regex / "Sophisticated" End of Line by umasuresh (Hermit) on Oct 27, 2010 at 01:51 UTC
Try something like this in Linux\| Cygwin command line `egrep "Homo sapiens" file_name \| egrep "NP\|XP"` [download] Note: Not tested!	[reply] [d/l]