Regular Expression Hiccup

Mindsword has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular Expression Hiccup by hdb (Monsignor) on May 04, 2015 at 20:48 UTC
There are also different capitalizations of "crystallography" in this thread. Have you tried matching case-insensitive? `if ($text =~ /Screening Ligands by X-ray Crystallography/i) { say "MATCH!"; } else { say "NOPE!"; }` [download]	[reply] [d/l]
Re: Regular Expression Hiccup by Corion (Patriarch) on May 04, 2015 at 19:40 UTC
`/X ray/` [download] will never match `"X-ray"` [download] If you want a placeholder, the dot ("`.`") is the placeholder character in regular expressions (perlre).	[reply] [d/l] [select]
Re: Regular Expression Hiccup by GotToBTru (Prior) on May 04, 2015 at 20:22 UTC
Some titles have "X-ray" and some have "X ray"? `if ($text =~ /Screening Ligands by X[ -]ray Crystallography/) { say "MATCH!"; } else { say "NOPE!"; } }` [download] The square brackets contain a list of characters; if any one of them matches $text at that spot, the match succeeds. Dum Spiro Spero	[reply] [d/l]
Re: Regular Expression Hiccup by graff (Chancellor) on May 05, 2015 at 04:53 UTC
Depending on where the data is coming from (and if you happen to be dealing with utf-8 encoded text), it's possible that you may be dealing with a "hyphen-like" character that is not the traditional ASCII hyphen (`\x2d`) - it could be any of `[\x{2010}\x{2011}\x{2013}\x{2014}]` ("hyphen", "non-breaking hyphen", "em dash", "en dash"). In that case, provided that you are correctly ingesting the text data as utf8, using the regex "dot" wildcard (and ignoring case distinctions) is The Right Thing: `/x.ray/i`	[reply] [d/l] [select]
Re: Regular Expression Hiccup by Laurent_R (Canon) on May 04, 2015 at 20:18 UTC
The dash (or hyphen) has no special meaning in general regular expressions (except in character classes). So there is absolutely no problem with: `if ($text =~ /Screening Ligands by X-ray Crystallography/) { say "MATCH!"; } else { say "NOPE!"; }` [download] Please note that the code you showed has an extra closing curly brace. I also agree with Anonymous Monk that a regex might not be the ideal solution for your problem. The index built-in is likely to work faster, and the use of a hash might also be considered, but we don't know enough about what you are trying to do to be able to help you more precisely. Je suis Charlie.	[reply] [d/l]
Re: Regular Expression Hiccup by Anonymous Monk on May 04, 2015 at 19:50 UTC
Literal strings in regular expressions must match character-for-character. The "-" is just as significant as any other character, it isn't discarded or ignored "just because". A "-" is different from " ", so there's no match. Perhaps more importantly, what is it you are trying to accomplish? If you already know what you are looking for, verbatim, then there is no reason to search via regex, or even string cmp. Just place your titles in some hash structure, or a database, or whatever. Indices and searches are for looking up records by subfield or substring ie smaller parts of the full record.	[reply]
Re^2: Regular Expression Hiccup by Mindsword (Initiate) on May 04, 2015 at 20:21 UTC
Sorry for the late reply. Here's the block of text I am searching. --------- EFETCH RESULT(1..3): [ 1. Methods Mol Biol. 2014;1140:315-23. doi: 10.1007/978-1-4939-0354-2_ +23. Screening Ligands by X-ray crystallography. Davies DR(1). Author information: (1)Emerald Bio, 7869 NE Day Road W, Bainbridge Island, WA, 98110, USA, ddavies@embios.com. X-ray crystallography is an invaluable technique in structure-based dr +ug discovery, including fragment-based drug discovery, because it is the +only technique that can provide a complete three dimensional readout of the interaction between the small molecule and its macromolecular target. +X-ray diffraction (XRD) techniques can be employed as the sole method for co +nducting a screen of a fragment library, or it can be employed as the final techn +ique in a screening campaign to confirm putative "hit" compounds identified by a + variety of biochemical and/or biophysical screening techniques. Both approaches r +equire an efficient technique to prepare dozens to hundreds of crystals for data biochemical and/or biophysical screening techniques. Both approaches r +equire an efficient technique to prepare dozens to hundreds of crystals for data collection, and a reproducible way to deliver ligands to the crystal. +Here, a general method for screening cocktails of fragments is described. In c +ases where X-ray crystallography is employed as a method to verify putative hits, + the cocktails of fragments described below would simply be replaced with s +ingle fragment solutions. PMID: 24590727 [download] I have a list of 79 blocks of text that are written as if for a Reference Section of a paper, I think Apa style of formatting. I want to extract the PubMed IDs from the files. I found a way to get the abstract from Pubmed, which contains the IDs. The problem is, it comes with similar hits, so I need to make certain I have the correct ID. Thus, a Title search. I have the titles in a hash that is linked to the number they were in the file. The original plan was to cycle through the hash searching these abstracts to get the correct abstract, then extract the ID from it. The search "Screening Ligands by X-ray crystallography" doesn't work though. No match. "Screening Ligands by" does. I thought the issue may be the "-" anything before that works fine. Anything after it works too. but "X-ray" simply fails.	[reply] [d/l]
Re^3: Regular Expression Hiccup by AnomalousMonk (Archbishop) on May 04, 2015 at 21:46 UTC
... "X-ray" simply fails. What others have written remains true: a `'-'` (dash; hyphen) character is not the same as (and will not match) a `' '` (space) character. (Also pay attention to hdb's reply below (update: among others) concerning case-insensitive matching.) Consider these matches to get a feel for regex matching: c:\@Work\Perl\monks>perl -wMstrict -le "my @X = ('aX-raya', 'aX raya', 'aXraya', 'axraya', 'X Ray', 'aXYraya' +); ;; print 'case-sensitive matching:'; for my $s (@X) { printf qq{'$s' -> }; if ($s =~ m{ X [- ]? ray }xms) { print 'match'; } else { print 'NO match'; } } ;; print 'case-INsensitive matching:'; for my $s (@X) { printf qq{'$s' -> }; if ($s =~ m{ (?i) X [- ]? ray }xms) { print 'match'; } else { print 'NO match'; } } " case-sensitive matching: 'aX-raya' -> match 'aX raya' -> match 'aXraya' -> match 'axraya' -> NO match 'X Ray' -> NO match 'aXYraya' -> NO match case-INsensitive matching: 'aX-raya' -> match 'aX raya' -> match 'aXraya' -> match 'axraya' -> match 'X Ray' -> match 'aXYraya' -> NO match [download] Contrast the use of `[- ]?` with `.` (dot) as a placeholder to match anything at all: `c:\@Work\Perl\monks>perl -wMstrict -le "my @X = ('aX-raya', 'aX raya', 'aXraya', 'axraya', 'X Ray', 'aXYraya' +); ;; print 'placeholder matching:'; for my $s (@X) { printf qq{'$s' -> }; if ($s =~ m{ X . ray }xms) { print 'match'; } else { print 'NO match'; } } " placeholder matching: 'aX-raya' -> match 'aX raya' -> match 'aXraya' -> NO match 'axraya' -> NO match 'X Ray' -> NO match 'aXYraya' -> match` [download] In general, it's better to match exactly what you want. Please see the Perl regex documentation perlre, perlretut, perlrequick, and various regex tutorials on Perlmonks. Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]
Re^3: Regular Expression Hiccup by jeffa (Bishop) on May 04, 2015 at 21:17 UTC
The following works for me: `use strict; use warnings; use feature 'say'; my $text = do{ local $/; <DATA> }; say $text =~ /Screening Ligands by X-ray crystallography/ ? 'MATCH!': +'NOPE'; __DATA__ --------- EFETCH RESULT(1..3): [ 1. Methods Mol Biol. 2014;1140:315-23. doi: 10.1007/978-1-4939-0354-2_ +23. Screening Ligands by X-ray crystallography. Davies DR(1). Author information: <edited for brevity>` [download] Did you notice the lower case c?. However, you say that you "want to extract the PubMed IDs from the files." If that is the case then maybe you should just extract the number next to each occurrence of "`PMID:` " like so: `use strict; use warnings; use feature 'say'; my $text = do{ local $/; <DATA> }; my @pmids = $text =~ /PMID:\s+(\d+)/gm; say for @pmids; __DATA__ --------- EFETCH RESULT(1..3): [ Some title some author Author information: blah blah blah PMID: 24590727 --------- EFETCH RESULT(1..3): [ Some title some author Author information: blah blah blah PMID: 45867737 --------- EFETCH RESULT(1..3): [ Some title some author Author information: blah blah blah PMID: 52497072` [download] jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l] [select]