nofutur45 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I would like to ask for your help on the following regex. I tried out a few matching expressions, however, they did not generate the desired output.

I have a text file full of lines as follows:

  ACADM, Homo sapiensacyl-Coenzyme A dehydrogenase, C-4 to C-12 s +traight chain NP_001120800.1425 aa   ACADM, Pan troglodytesacyl-Coenzyme A dehydrogenase, C-4 to C-1 +2 straight chain XP_524741.2453 aa   ACADM, Canis lupus familiarisacyl-Coenzyme A dehydrogenase, C-4 + to C-12 straight chain XP_547328.2444 aa   ACADM, Bos taurusacyl-coenzyme A dehydrogenase, C-4 to C-12 str +aight chain NP_001068703.1421 aa   Acadm, Mus musculusacyl-Coenzyme A dehydrogenase, medium chain + NP_031408.1421 aa   Acadm, Rattus norvegicusacyl-Coenzyme A dehydrogenase, C-4 to C +-12 straight chain NP_058682.14   ACADVL, Homo sapiensacyl-Coenzyme A dehydrogenase, very long ch +ain NP_000009.1655 aa   ACADVL, Canis lupus familiarisacyl-Coenzyme A dehydrogenase, ve +ry long chain XP_546581.2650 aa   ACADVL, Bos taurusacyl-coenzyme A dehydrogenase, very long chai +n NP_776919.1655 aa   Acadvl, Mus musculusacyl-Coenzyme A dehydrogenase, very long ch +ain NP_059062.1656 aa   Acadvl, Rattus norvegicusacyl-Coenzyme A dehydrogenase, very lo +ng chain NP_037023.1655 aa   acadvl, Danio rerioacyl-Coenzyme A dehydrogenase, very long cha +in NP_997776.1659 aa   CG7461, Drosophila melanogasterCG7461 NP_611409.1655 aa   AgaP_AGAP008769, Anopheles gambiaeAGAP008769-PA XP_314893.4 +624 aa   acdh-12, Caenorhabditis elegansAcyl CoA DeHydrogenase NP_00 +1022062.1613 aa   ALAS2, Homo sapiensaminolevulinate, delta-, synthase 2 NP_0 +00023.2587 aa   ALAS2, Pan troglodytesaminolevulinate, delta-, synthase 2 X +P_001147099.1583 aa   ALAS2, Canis lupus familiarisaminolevulinate, delta-, synthase +2 XP_548619.2587 aa   ALAS2, Bos taurusaminolevulinate, delta-, synthase 2 NP_001 +030275.1587 aa   Alas2, Mus musculusaminolevulinic acid synthase 2, erythroid + NP_033783.1587 aa   Alas2, Rattus norvegicusaminolevulinate, delta-, synthase 2 + XP_001062452.1267 aa   alas2, Danio rerioaminolevulinate, delta-, synthetase 2 NP_ +571757.1583 aa   hem1, Schizosaccharomyces pombe5-aminolevulinate synthase N +P_594388.1558 aa   HEM1, Saccharomyces cerevisiae5-aminolevulinate synthase, catal +yzes the first step in the heme biosynthetic pathway; an N-terminal s +ignal sequence is required for localization to the mitochondrial matr +ix; expression is regulated by Hap2p-Hap3p NP_010518.1548 aa   HEM1_KLULA, Kluyveromyces lactishypothetical protein XP_452 +875.1570 aa   AGOS_ABL104C, Eremothecium gossypiiABL104Cp NP_982843.1556 +aa   MGG_06446, Magnaporthe grisea5-aminolevulinate synthase, mitoch +ondrial precursor XP_369931.2615 aa   NCU06189.1, Neurospora crassahypothetical protein ( (AB071862) +5-aminolevulinate synthase [Gibberella fujikuroi] ) XP_326044.154 +4 aa

I need to extract the NM_ / XP_ combination for only Homo sapiens. I reasoned that I should first make sure that the selected line includes Homo sapiens, and then pick up the combination by matching backwards. So I came up with the following expression:

elsif($line =~ /^\S+\s+\S+\s+Homo sapiens(\S+)\s\S+$/)

The part between Homo sapiens and the combination varies in length, particularly in number of words. Therefore, there is no unique matching expression that would fit all possible inbetween text.

I thought this expression would work since I "guessed" it would confirm matches sufficiently as to search from the beginning until the end of sapiens and to search from the end until the last characters of sapiens. However, it does not work.

I would be very happy if you can help me with the use of $ to capture not the last string but a particular string starting backwards.

Thank you

Replies are listed 'Best First'.
Re: Regex / "Sophisticated" End of Line
by kcott (Archbishop) on Oct 27, 2010 at 03:44 UTC

    You were almost there. This works:

    $ perl -wE 'while (<>) {if (m{ \A \S+ \s+ \S+ \s+ Homo \s+ sapiens .*? + ([NX] \S+) \s+ \S+ \s+ \z }msx) {say $1}}' < 867580.dat NP_001120800.1425 NP_000009.1655 NP_000023.2587

    There's some whitespace at the end of the lines so your final \S+$ wasn't matching.

    -- Ken

Re: Regex / "Sophisticated" End of Line
by zwon (Abbot) on Oct 27, 2010 at 01:07 UTC

    I don't exactly understand what are you trying to do. Could you tell us what do you want to extract from the line

    &#160; ACADM, Homo sapiensacyl-Coenzyme A dehydrogenase, C-4 to C-12 s +traight chain NP_001120800.1425 aa

    Is it acyl-Coenzyme or NP_001120800.1425?

      Hi zwon,

      Thank you for your reply. In your example, it would be NP_001120800.1425

      .

      The following script by james2vegas - with a little modification - solved my problem:

      elsif (m/, Homo sapiens/) { my ($human) = m/((?:XP_|NP_)[\d. ]+)\s+/; $human = $1; print OUTFILE $1 . "\t";

      So it works in two parts: First get the line which includes ", Homo sapiens", and then look for the NP_ or XP_ combination

      I was curious if there is a way to pick up, say, the second "compact word" (I do not know how to say this properly, but a series of non-space characters) from the end. Another kind user provided an answer to this question, where in his solution he uses split by a space character, put the elements into an array and pick up the -2nd element, which is the second from the end.

      Thank you for your help, guys.

Re: Regex / "Sophisticated" End of Line
by aquarium (Curate) on Oct 27, 2010 at 02:46 UTC
    It's easier to read/modify/extend if you separate your matching requirements, rather than a convoluted look behind or look ahead expression...untested
    if($line=~/Homo sapiens/) { if($line=~/((NP_|XP_))([^ ]+)/) { $matched=$1 . $2; ...do other stuff... } }
    or similar
    the hardest line to type correctly is: stty erase ^H
Re: Regex / "Sophisticated" End of Line
by morgon (Priest) on Oct 27, 2010 at 01:22 UTC
    Not sure if I undestand correctly what you try to do, but I assume that you want to extract the words that either start with "NM_" or with "XP_" for all lines that contain "Homo sapiens".

    If that is what you want it is easy:

    while (my $line=<$fh>) { # assuming a loop through your file here my ($nm_xp) = $line =~ /Homo sapiens.*\s(NM_\w*|XP_\w*)/; next unless $nm_xp; print "$nm_xp for a great ape found at line $.\n"; }
Re: Regex / "Sophisticated" End of Line
by umasuresh (Hermit) on Oct 27, 2010 at 01:51 UTC
    Try something like this in Linux| Cygwin command line
    egrep "Homo sapiens" file_name | egrep "NP|XP"
    Note: Not tested!