in reply to Re: Regular Expression Hiccup
in thread Regular Expression Hiccup

Sorry for the late reply. Here's the block of text I am searching.
--------- EFETCH RESULT(1..3): [ 1. Methods Mol Biol. 2014;1140:315-23. doi: 10.1007/978-1-4939-0354-2_ +23. Screening Ligands by X-ray crystallography. Davies DR(1). Author information: (1)Emerald Bio, 7869 NE Day Road W, Bainbridge Island, WA, 98110, USA, ddavies@embios.com. X-ray crystallography is an invaluable technique in structure-based dr +ug discovery, including fragment-based drug discovery, because it is the +only technique that can provide a complete three dimensional readout of the interaction between the small molecule and its macromolecular target. +X-ray diffraction (XRD) techniques can be employed as the sole method for co +nducting a screen of a fragment library, or it can be employed as the final techn +ique in a screening campaign to confirm putative "hit" compounds identified by a + variety of biochemical and/or biophysical screening techniques. Both approaches r +equire an efficient technique to prepare dozens to hundreds of crystals for data biochemical and/or biophysical screening techniques. Both approaches r +equire an efficient technique to prepare dozens to hundreds of crystals for data collection, and a reproducible way to deliver ligands to the crystal. +Here, a general method for screening cocktails of fragments is described. In c +ases where X-ray crystallography is employed as a method to verify putative hits, + the cocktails of fragments described below would simply be replaced with s +ingle fragment solutions. PMID: 24590727
I have a list of 79 blocks of text that are written as if for a Reference Section of a paper, I think Apa style of formatting. I want to extract the PubMed IDs from the files. I found a way to get the abstract from Pubmed, which contains the IDs. The problem is, it comes with similar hits, so I need to make certain I have the correct ID. Thus, a Title search. I have the titles in a hash that is linked to the number they were in the file. The original plan was to cycle through the hash searching these abstracts to get the correct abstract, then extract the ID from it.

The search "Screening Ligands by X-ray crystallography" doesn't work though. No match. "Screening Ligands by" does. I thought the issue may be the "-" anything before that works fine. Anything after it works too. but "X-ray" simply fails.

Replies are listed 'Best First'.
Re^3: Regular Expression Hiccup
by AnomalousMonk (Archbishop) on May 04, 2015 at 21:46 UTC
    ... "X-ray" simply fails.

    What others have written remains true: a '-' (dash; hyphen) character is not the same as (and will not match) a ' ' (space) character. (Also pay attention to hdb's reply below (update: among others) concerning case-insensitive matching.) Consider these matches to get a feel for regex matching:

    c:\@Work\Perl\monks>perl -wMstrict -le "my @X = ('aX-raya', 'aX raya', 'aXraya', 'axraya', 'X Ray', 'aXYraya' +); ;; print 'case-sensitive matching:'; for my $s (@X) { printf qq{'$s' -> }; if ($s =~ m{ X [- ]? ray }xms) { print 'match'; } else { print 'NO match'; } } ;; print 'case-INsensitive matching:'; for my $s (@X) { printf qq{'$s' -> }; if ($s =~ m{ (?i) X [- ]? ray }xms) { print 'match'; } else { print 'NO match'; } } " case-sensitive matching: 'aX-raya' -> match 'aX raya' -> match 'aXraya' -> match 'axraya' -> NO match 'X Ray' -> NO match 'aXYraya' -> NO match case-INsensitive matching: 'aX-raya' -> match 'aX raya' -> match 'aXraya' -> match 'axraya' -> match 'X Ray' -> match 'aXYraya' -> NO match

    Contrast the use of  [- ]? with  . (dot) as a placeholder to match anything at all:

    c:\@Work\Perl\monks>perl -wMstrict -le "my @X = ('aX-raya', 'aX raya', 'aXraya', 'axraya', 'X Ray', 'aXYraya' +); ;; print 'placeholder matching:'; for my $s (@X) { printf qq{'$s' -> }; if ($s =~ m{ X . ray }xms) { print 'match'; } else { print 'NO match'; } } " placeholder matching: 'aX-raya' -> match 'aX raya' -> match 'aXraya' -> NO match 'axraya' -> NO match 'X Ray' -> NO match 'aXYraya' -> match
    In general, it's better to match exactly what you want.

    Please see the Perl regex documentation perlre, perlretut, perlrequick, and various regex tutorials on Perlmonks.


    Give a man a fish:  <%-(-(-(-<

Re^3: Regular Expression Hiccup
by jeffa (Bishop) on May 04, 2015 at 21:17 UTC

    The following works for me:

    use strict; use warnings; use feature 'say'; my $text = do{ local $/; <DATA> }; say $text =~ /Screening Ligands by X-ray crystallography/ ? 'MATCH!': +'NOPE'; __DATA__ --------- EFETCH RESULT(1..3): [ 1. Methods Mol Biol. 2014;1140:315-23. doi: 10.1007/978-1-4939-0354-2_ +23. Screening Ligands by X-ray crystallography. Davies DR(1). Author information: <edited for brevity>

     

    Did you notice the lower case c?. However, you say that you "want to extract the PubMed IDs from the files." If that is the case then maybe you should just extract the number next to each occurrence of "PMID: " like so:

    use strict; use warnings; use feature 'say'; my $text = do{ local $/; <DATA> }; my @pmids = $text =~ /PMID:\s+(\d+)/gm; say for @pmids; __DATA__ --------- EFETCH RESULT(1..3): [ Some title some author Author information: blah blah blah PMID: 24590727 --------- EFETCH RESULT(1..3): [ Some title some author Author information: blah blah blah PMID: 45867737 --------- EFETCH RESULT(1..3): [ Some title some author Author information: blah blah blah PMID: 52497072

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)