in reply to Regular Expression Hiccup

Literal strings in regular expressions must match character-for-character. The "-" is just as significant as any other character, it isn't discarded or ignored "just because". A "-" is different from " ", so there's no match.

Perhaps more importantly, what is it you are trying to accomplish? If you already know what you are looking for, verbatim, then there is no reason to search via regex, or even string cmp. Just place your titles in some hash structure, or a database, or whatever. Indices and searches are for looking up records by subfield or substring ie smaller parts of the full record.

Replies are listed 'Best First'.
Re^2: Regular Expression Hiccup
by Mindsword (Initiate) on May 04, 2015 at 20:21 UTC
    Sorry for the late reply. Here's the block of text I am searching.
    --------- EFETCH RESULT(1..3): [ 1. Methods Mol Biol. 2014;1140:315-23. doi: 10.1007/978-1-4939-0354-2_ +23. Screening Ligands by X-ray crystallography. Davies DR(1). Author information: (1)Emerald Bio, 7869 NE Day Road W, Bainbridge Island, WA, 98110, USA, ddavies@embios.com. X-ray crystallography is an invaluable technique in structure-based dr +ug discovery, including fragment-based drug discovery, because it is the +only technique that can provide a complete three dimensional readout of the interaction between the small molecule and its macromolecular target. +X-ray diffraction (XRD) techniques can be employed as the sole method for co +nducting a screen of a fragment library, or it can be employed as the final techn +ique in a screening campaign to confirm putative "hit" compounds identified by a + variety of biochemical and/or biophysical screening techniques. Both approaches r +equire an efficient technique to prepare dozens to hundreds of crystals for data biochemical and/or biophysical screening techniques. Both approaches r +equire an efficient technique to prepare dozens to hundreds of crystals for data collection, and a reproducible way to deliver ligands to the crystal. +Here, a general method for screening cocktails of fragments is described. In c +ases where X-ray crystallography is employed as a method to verify putative hits, + the cocktails of fragments described below would simply be replaced with s +ingle fragment solutions. PMID: 24590727
    I have a list of 79 blocks of text that are written as if for a Reference Section of a paper, I think Apa style of formatting. I want to extract the PubMed IDs from the files. I found a way to get the abstract from Pubmed, which contains the IDs. The problem is, it comes with similar hits, so I need to make certain I have the correct ID. Thus, a Title search. I have the titles in a hash that is linked to the number they were in the file. The original plan was to cycle through the hash searching these abstracts to get the correct abstract, then extract the ID from it.

    The search "Screening Ligands by X-ray crystallography" doesn't work though. No match. "Screening Ligands by" does. I thought the issue may be the "-" anything before that works fine. Anything after it works too. but "X-ray" simply fails.

      ... "X-ray" simply fails.

      What others have written remains true: a '-' (dash; hyphen) character is not the same as (and will not match) a ' ' (space) character. (Also pay attention to hdb's reply below (update: among others) concerning case-insensitive matching.) Consider these matches to get a feel for regex matching:

      c:\@Work\Perl\monks>perl -wMstrict -le "my @X = ('aX-raya', 'aX raya', 'aXraya', 'axraya', 'X Ray', 'aXYraya' +); ;; print 'case-sensitive matching:'; for my $s (@X) { printf qq{'$s' -> }; if ($s =~ m{ X [- ]? ray }xms) { print 'match'; } else { print 'NO match'; } } ;; print 'case-INsensitive matching:'; for my $s (@X) { printf qq{'$s' -> }; if ($s =~ m{ (?i) X [- ]? ray }xms) { print 'match'; } else { print 'NO match'; } } " case-sensitive matching: 'aX-raya' -> match 'aX raya' -> match 'aXraya' -> match 'axraya' -> NO match 'X Ray' -> NO match 'aXYraya' -> NO match case-INsensitive matching: 'aX-raya' -> match 'aX raya' -> match 'aXraya' -> match 'axraya' -> match 'X Ray' -> match 'aXYraya' -> NO match

      Contrast the use of  [- ]? with  . (dot) as a placeholder to match anything at all:

      c:\@Work\Perl\monks>perl -wMstrict -le "my @X = ('aX-raya', 'aX raya', 'aXraya', 'axraya', 'X Ray', 'aXYraya' +); ;; print 'placeholder matching:'; for my $s (@X) { printf qq{'$s' -> }; if ($s =~ m{ X . ray }xms) { print 'match'; } else { print 'NO match'; } } " placeholder matching: 'aX-raya' -> match 'aX raya' -> match 'aXraya' -> NO match 'axraya' -> NO match 'X Ray' -> NO match 'aXYraya' -> match
      In general, it's better to match exactly what you want.

      Please see the Perl regex documentation perlre, perlretut, perlrequick, and various regex tutorials on Perlmonks.


      Give a man a fish:  <%-(-(-(-<

      The following works for me:

      use strict; use warnings; use feature 'say'; my $text = do{ local $/; <DATA> }; say $text =~ /Screening Ligands by X-ray crystallography/ ? 'MATCH!': +'NOPE'; __DATA__ --------- EFETCH RESULT(1..3): [ 1. Methods Mol Biol. 2014;1140:315-23. doi: 10.1007/978-1-4939-0354-2_ +23. Screening Ligands by X-ray crystallography. Davies DR(1). Author information: <edited for brevity>

       

      Did you notice the lower case c?. However, you say that you "want to extract the PubMed IDs from the files." If that is the case then maybe you should just extract the number next to each occurrence of "PMID: " like so:

      use strict; use warnings; use feature 'say'; my $text = do{ local $/; <DATA> }; my @pmids = $text =~ /PMID:\s+(\d+)/gm; say for @pmids; __DATA__ --------- EFETCH RESULT(1..3): [ Some title some author Author information: blah blah blah PMID: 24590727 --------- EFETCH RESULT(1..3): [ Some title some author Author information: blah blah blah PMID: 45867737 --------- EFETCH RESULT(1..3): [ Some title some author Author information: blah blah blah PMID: 52497072

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)