Hi,
I would like to ask for your help on the following regex. I tried out a few matching expressions, however, they did not generate the desired output.
I have a text file full of lines as follows:
  ACADM, Homo sapiensacyl-Coenzyme A dehydrogenase, C-4 to C-12 s +traight chain NP_001120800.1425 aa   ACADM, Pan troglodytesacyl-Coenzyme A dehydrogenase, C-4 to C-1 +2 straight chain XP_524741.2453 aa   ACADM, Canis lupus familiarisacyl-Coenzyme A dehydrogenase, C-4 + to C-12 straight chain XP_547328.2444 aa   ACADM, Bos taurusacyl-coenzyme A dehydrogenase, C-4 to C-12 str +aight chain NP_001068703.1421 aa   Acadm, Mus musculusacyl-Coenzyme A dehydrogenase, medium chain + NP_031408.1421 aa   Acadm, Rattus norvegicusacyl-Coenzyme A dehydrogenase, C-4 to C +-12 straight chain NP_058682.14   ACADVL, Homo sapiensacyl-Coenzyme A dehydrogenase, very long ch +ain NP_000009.1655 aa   ACADVL, Canis lupus familiarisacyl-Coenzyme A dehydrogenase, ve +ry long chain XP_546581.2650 aa   ACADVL, Bos taurusacyl-coenzyme A dehydrogenase, very long chai +n NP_776919.1655 aa   Acadvl, Mus musculusacyl-Coenzyme A dehydrogenase, very long ch +ain NP_059062.1656 aa   Acadvl, Rattus norvegicusacyl-Coenzyme A dehydrogenase, very lo +ng chain NP_037023.1655 aa   acadvl, Danio rerioacyl-Coenzyme A dehydrogenase, very long cha +in NP_997776.1659 aa   CG7461, Drosophila melanogasterCG7461 NP_611409.1655 aa   AgaP_AGAP008769, Anopheles gambiaeAGAP008769-PA XP_314893.4 +624 aa   acdh-12, Caenorhabditis elegansAcyl CoA DeHydrogenase NP_00 +1022062.1613 aa   ALAS2, Homo sapiensaminolevulinate, delta-, synthase 2 NP_0 +00023.2587 aa   ALAS2, Pan troglodytesaminolevulinate, delta-, synthase 2 X +P_001147099.1583 aa   ALAS2, Canis lupus familiarisaminolevulinate, delta-, synthase +2 XP_548619.2587 aa   ALAS2, Bos taurusaminolevulinate, delta-, synthase 2 NP_001 +030275.1587 aa   Alas2, Mus musculusaminolevulinic acid synthase 2, erythroid + NP_033783.1587 aa   Alas2, Rattus norvegicusaminolevulinate, delta-, synthase 2 + XP_001062452.1267 aa   alas2, Danio rerioaminolevulinate, delta-, synthetase 2 NP_ +571757.1583 aa   hem1, Schizosaccharomyces pombe5-aminolevulinate synthase N +P_594388.1558 aa   HEM1, Saccharomyces cerevisiae5-aminolevulinate synthase, catal +yzes the first step in the heme biosynthetic pathway; an N-terminal s +ignal sequence is required for localization to the mitochondrial matr +ix; expression is regulated by Hap2p-Hap3p NP_010518.1548 aa   HEM1_KLULA, Kluyveromyces lactishypothetical protein XP_452 +875.1570 aa   AGOS_ABL104C, Eremothecium gossypiiABL104Cp NP_982843.1556 +aa   MGG_06446, Magnaporthe grisea5-aminolevulinate synthase, mitoch +ondrial precursor XP_369931.2615 aa   NCU06189.1, Neurospora crassahypothetical protein ( (AB071862) +5-aminolevulinate synthase [Gibberella fujikuroi] ) XP_326044.154 +4 aa
I need to extract the NM_ / XP_ combination for only Homo sapiens. I reasoned that I should first make sure that the selected line includes Homo sapiens, and then pick up the combination by matching backwards. So I came up with the following expression:
elsif($line =~ /^\S+\s+\S+\s+Homo sapiens(\S+)\s\S+$/)The part between Homo sapiens and the combination varies in length, particularly in number of words. Therefore, there is no unique matching expression that would fit all possible inbetween text.
I thought this expression would work since I "guessed" it would confirm matches sufficiently as to search from the beginning until the end of sapiens and to search from the end until the last characters of sapiens. However, it does not work.
I would be very happy if you can help me with the use of $ to capture not the last string but a particular string starting backwards.
Thank you
In reply to Regex / "Sophisticated" End of Line by nofutur45
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |