Mindsword has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have been working on a code for work and I found an odd glitch in my regular expression. I am looking through several blocks of texts for the Titles of some documents. The first title is "Screening Ligands by X-ray Crystallography." The text with that clearly has Screening Ligands by X-ray Crystallography in it. However, my regualr expression (below) does not pick it up. I have removed the 40 lines of code before this for brevities sake.

#!/usr/bin/perl use warnings; use strict; use feature qw(say); ... ... if ($text =~ /Screening Ligands by X ray Crystallography/) { say "MATCH!"; } else { say "NOPE!"; } }
However, if I do the code below, I do get a match.
if ($text =~ /Screening Ligands by/) { say "MATCH!"; } else { say "NOPE!"; } }
I have narrowed down the issue to the "-" I think. However, I can't search for the titles without the "-" as maybe 2/3rds of them have it. Any thoughts on how to get around this?

EDIT: Based on what you all wrote, I began looking at a few of the others that were failing and have proven its not the "-" Its still not working correctly, but I at least know its not what I thought. I may even know what the issue is, but I'll need to run some tests. I think it may be an issue from the original titles not matching what's actually in the document.

Replies are listed 'Best First'.
Re: Regular Expression Hiccup
by hdb (Monsignor) on May 04, 2015 at 20:48 UTC

    There are also different capitalizations of "crystallography" in this thread. Have you tried matching case-insensitive?

    if ($text =~ /Screening Ligands by X-ray Crystallography/i) { say "MATCH!"; } else { say "NOPE!"; }
Re: Regular Expression Hiccup
by Corion (Patriarch) on May 04, 2015 at 19:40 UTC
    /X ray/

    will never match

    "X-ray"

    If you want a placeholder, the dot (".") is the placeholder character in regular expressions (perlre).

Re: Regular Expression Hiccup
by GotToBTru (Prior) on May 04, 2015 at 20:22 UTC

    Some titles have "X-ray" and some have "X ray"?

    if ($text =~ /Screening Ligands by X[ -]ray Crystallography/) { say "MATCH!"; } else { say "NOPE!"; } }

    The square brackets contain a list of characters; if any one of them matches $text at that spot, the match succeeds.

    Dum Spiro Spero
Re: Regular Expression Hiccup
by graff (Chancellor) on May 05, 2015 at 04:53 UTC
    Depending on where the data is coming from (and if you happen to be dealing with utf-8 encoded text), it's possible that you may be dealing with a "hyphen-like" character that is not the traditional ASCII hyphen (\x2d) - it could be any of [\x{2010}\x{2011}\x{2013}\x{2014}] ("hyphen", "non-breaking hyphen", "em dash", "en dash").

    In that case, provided that you are correctly ingesting the text data as utf8, using the regex "dot" wildcard (and ignoring case distinctions) is The Right Thing:  /x.ray/i

Re: Regular Expression Hiccup
by Laurent_R (Canon) on May 04, 2015 at 20:18 UTC
    The dash (or hyphen) has no special meaning in general regular expressions (except in character classes). So there is absolutely no problem with:
    if ($text =~ /Screening Ligands by X-ray Crystallography/) { say "MATCH!"; } else { say "NOPE!"; }
    Please note that the code you showed has an extra closing curly brace.

    I also agree with Anonymous Monk that a regex might not be the ideal solution for your problem. The index built-in is likely to work faster, and the use of a hash might also be considered, but we don't know enough about what you are trying to do to be able to help you more precisely.

    Je suis Charlie.
Re: Regular Expression Hiccup
by Anonymous Monk on May 04, 2015 at 19:50 UTC

    Literal strings in regular expressions must match character-for-character. The "-" is just as significant as any other character, it isn't discarded or ignored "just because". A "-" is different from " ", so there's no match.

    Perhaps more importantly, what is it you are trying to accomplish? If you already know what you are looking for, verbatim, then there is no reason to search via regex, or even string cmp. Just place your titles in some hash structure, or a database, or whatever. Indices and searches are for looking up records by subfield or substring ie smaller parts of the full record.

      Sorry for the late reply. Here's the block of text I am searching.
      --------- EFETCH RESULT(1..3): [ 1. Methods Mol Biol. 2014;1140:315-23. doi: 10.1007/978-1-4939-0354-2_ +23. Screening Ligands by X-ray crystallography. Davies DR(1). Author information: (1)Emerald Bio, 7869 NE Day Road W, Bainbridge Island, WA, 98110, USA, ddavies@embios.com. X-ray crystallography is an invaluable technique in structure-based dr +ug discovery, including fragment-based drug discovery, because it is the +only technique that can provide a complete three dimensional readout of the interaction between the small molecule and its macromolecular target. +X-ray diffraction (XRD) techniques can be employed as the sole method for co +nducting a screen of a fragment library, or it can be employed as the final techn +ique in a screening campaign to confirm putative "hit" compounds identified by a + variety of biochemical and/or biophysical screening techniques. Both approaches r +equire an efficient technique to prepare dozens to hundreds of crystals for data biochemical and/or biophysical screening techniques. Both approaches r +equire an efficient technique to prepare dozens to hundreds of crystals for data collection, and a reproducible way to deliver ligands to the crystal. +Here, a general method for screening cocktails of fragments is described. In c +ases where X-ray crystallography is employed as a method to verify putative hits, + the cocktails of fragments described below would simply be replaced with s +ingle fragment solutions. PMID: 24590727
      I have a list of 79 blocks of text that are written as if for a Reference Section of a paper, I think Apa style of formatting. I want to extract the PubMed IDs from the files. I found a way to get the abstract from Pubmed, which contains the IDs. The problem is, it comes with similar hits, so I need to make certain I have the correct ID. Thus, a Title search. I have the titles in a hash that is linked to the number they were in the file. The original plan was to cycle through the hash searching these abstracts to get the correct abstract, then extract the ID from it.

      The search "Screening Ligands by X-ray crystallography" doesn't work though. No match. "Screening Ligands by" does. I thought the issue may be the "-" anything before that works fine. Anything after it works too. but "X-ray" simply fails.

        ... "X-ray" simply fails.

        What others have written remains true: a '-' (dash; hyphen) character is not the same as (and will not match) a ' ' (space) character. (Also pay attention to hdb's reply below (update: among others) concerning case-insensitive matching.) Consider these matches to get a feel for regex matching:

        c:\@Work\Perl\monks>perl -wMstrict -le "my @X = ('aX-raya', 'aX raya', 'aXraya', 'axraya', 'X Ray', 'aXYraya' +); ;; print 'case-sensitive matching:'; for my $s (@X) { printf qq{'$s' -> }; if ($s =~ m{ X [- ]? ray }xms) { print 'match'; } else { print 'NO match'; } } ;; print 'case-INsensitive matching:'; for my $s (@X) { printf qq{'$s' -> }; if ($s =~ m{ (?i) X [- ]? ray }xms) { print 'match'; } else { print 'NO match'; } } " case-sensitive matching: 'aX-raya' -> match 'aX raya' -> match 'aXraya' -> match 'axraya' -> NO match 'X Ray' -> NO match 'aXYraya' -> NO match case-INsensitive matching: 'aX-raya' -> match 'aX raya' -> match 'aXraya' -> match 'axraya' -> match 'X Ray' -> match 'aXYraya' -> NO match

        Contrast the use of  [- ]? with  . (dot) as a placeholder to match anything at all:

        c:\@Work\Perl\monks>perl -wMstrict -le "my @X = ('aX-raya', 'aX raya', 'aXraya', 'axraya', 'X Ray', 'aXYraya' +); ;; print 'placeholder matching:'; for my $s (@X) { printf qq{'$s' -> }; if ($s =~ m{ X . ray }xms) { print 'match'; } else { print 'NO match'; } } " placeholder matching: 'aX-raya' -> match 'aX raya' -> match 'aXraya' -> NO match 'axraya' -> NO match 'X Ray' -> NO match 'aXYraya' -> match
        In general, it's better to match exactly what you want.

        Please see the Perl regex documentation perlre, perlretut, perlrequick, and various regex tutorials on Perlmonks.


        Give a man a fish:  <%-(-(-(-<

        The following works for me:

        use strict; use warnings; use feature 'say'; my $text = do{ local $/; <DATA> }; say $text =~ /Screening Ligands by X-ray crystallography/ ? 'MATCH!': +'NOPE'; __DATA__ --------- EFETCH RESULT(1..3): [ 1. Methods Mol Biol. 2014;1140:315-23. doi: 10.1007/978-1-4939-0354-2_ +23. Screening Ligands by X-ray crystallography. Davies DR(1). Author information: <edited for brevity>

         

        Did you notice the lower case c?. However, you say that you "want to extract the PubMed IDs from the files." If that is the case then maybe you should just extract the number next to each occurrence of "PMID: " like so:

        use strict; use warnings; use feature 'say'; my $text = do{ local $/; <DATA> }; my @pmids = $text =~ /PMID:\s+(\d+)/gm; say for @pmids; __DATA__ --------- EFETCH RESULT(1..3): [ Some title some author Author information: blah blah blah PMID: 24590727 --------- EFETCH RESULT(1..3): [ Some title some author Author information: blah blah blah PMID: 45867737 --------- EFETCH RESULT(1..3): [ Some title some author Author information: blah blah blah PMID: 52497072

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)