Fine tuning a reg exp

markjrouse has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Fine tuning a reg exp (\w) by tye (Sage) on Feb 23, 2012 at 15:27 UTC
\w includes `[a-zA-Z]` (and others) so `[A-Z \w]` is exactly the same as `[ \w]` and certainly matches a-z. - tye	[reply] [d/l] [select]
Re^2: Fine tuning a reg exp (\w) by markjrouse (Initiate) on Feb 23, 2012 at 15:34 UTC
Thanks, still learning reg exps. Even if I do this: `^([A-Z]+\s[A-Z]+,)` [download] I still get lowercase characters: `ABU BAKR Sabaqah School` [download]	[reply] [d/l] [select]
Re^3: Fine tuning a reg exp (\w) by toolic (Bishop) on Feb 23, 2012 at 15:40 UTC
I don't get that output with your regex: `use warnings; use strict; while (<DATA>) { print if /^([A-Z]+\s[A-Z]+,)/; } __DATA__ ABU BAKR, Ibrahim Ali Muhammad (a.k.a. AL-LIBI, Abd al-Muhsin) (individual) [SDGT] AFGHAN SUPPORT COMMITTEE (ASC) (a.k.a. AHYA UL TURAS; a.k.a. JAMIAT AYAT-UR-RHAS AL ISLAMIA; a.k.a. JAMIAT IHYA UL TURATH AL ISLAMIA; a.k.a. LAJNAT UL MASA EIDATUL AFGHANIA) Grand Trunk Road, near Pushtoon Garhi Pabbi, Peshawar, Pakistan; Cheprahar Hadda, Mia Omar Sabaqah School, Jalalabad, Afghanistan [SDGT]` [download] prints: `ABU BAKR, Ibrahim Ali Muhammad (a.k.a. AL-LIBI, Abd al-Muhsin)` [download] See http://sscce.org	[reply] [d/l] [select]
Re^3: Fine tuning a reg exp (\w) by tye (Sage) on Feb 23, 2012 at 15:40 UTC
Since the regex you show will not match any lowercase letters and requires a comma and the match you show contains lowercase letters and no comma, I'm pretty sure you are not running the code you think you are (or similar mistake). - tye	[reply]
Re^3: Fine tuning a reg exp (\w) by bitingduck (Deacon) on Feb 23, 2012 at 16:23 UTC
Try posting your complete regex from "s" through ";" (e.g. `s/([ A-Z]+)/myreplacement/gixm;` -- it will make it clear what you're actually applying. Even better, post a small chunk of runnable code. It sounds from previous posts that you're just reading in lines and processing them in a pretty straightforward way, so the code should be quite short. A couple hours with a regex tutorial would also help you get going faster- the questions you're posting are chapter 1 or 2 kind of things, because they're almost the first thing everyone wants to do.	[reply] [d/l]
Re^3: Fine tuning a reg exp (\w) by choroba (Cardinal) on Feb 23, 2012 at 15:44 UTC
Negative. `perl -e 'print "Sabaqah School," =~ /^([A-Z]+\s[A-Z]+),/ ? "Yes" : "No +"' No` [download]	[reply] [d/l]
Re^3: Fine tuning a reg exp (\w) by brx (Pilgrim) on Feb 23, 2012 at 16:20 UTC
You should post your program. Take care of "modifiers" (http://perldoc.perl.org/perlre.html#Modifiers) : `while (<DATA>) { /([A-Z]+)/i; # upper AND lower! print $1,"\n"; }` [download] output : ABU individual AFGHAN AYAT a Pushtoon Sabaqah	[reply] [d/l]
Re: Fine tuning a reg exp by runrig (Abbot) on Feb 23, 2012 at 15:52 UTC
See perlre. All uppercase can be matched with: `/^[[:upper:]]+$/` [download] Update: This is unnecessary and doesn't help much. `[A-Z]` does just as well unless you consider unicode...	[reply] [d/l] [select]
Re^2: Fine tuning a reg exp by markjrouse (Initiate) on Feb 23, 2012 at 17:01 UTC
Thanks all, this is all really useful. I think I've got this bit now, but using the same text as an example: `ABU BAKR, Ibrahim Ali Muhammad (a.k.a. AL-LIBI, Abd al-Muhsin) (individual) [SDGT] AFGHAN SUPPORT COMMITTEE (ASC) (a.k.a. AHYA UL TURAS; a.k.a. JAMIAT AYAT-UR-RHAS AL ISLAMIA; a.k.a. JAMIAT IHYA UL TURATH AL ISLAMIA; a.k.a. LAJNAT UL MASA EIDATUL AFGHANIA) Grand Trunk Road, near Pushtoon Garhi Pabbi, Peshawar, Pakistan; Cheprahar Hadda, Mia Omar Sabaqah School, Jalalabad, Afghanistan [SDGT]` [download] I'm now trying to match firstnames. What reg exp is need to match `Ibrahim Ali Muhammad` [download] the reason being is that I'm trying to add tags to a text document, so that I can work manipulate it like this: `$line =~ s/regexp\<name\>$1\<\/name\>/;` [download] I want to achieve this: `ABU BAKR, <name>Ibrahim Ali Muhammad</name> (a.k.a. AL-LIBI, Abd al-Mu +hsin) (individual) [SDGT] AFGHAN SUPPORT COMMITTEE (ASC) (a.k.a. AHYA UL TURAS; a.k.a. JAMIAT AYAT-UR-RHAS AL ISLAMIA; a.k.a. JAMIAT IHYA UL TURATH AL ISLAMIA; a.k.a. LAJNAT UL MASA EIDATUL AFGHANIA) Grand Trunk Road, near Pushtoon Garhi Pabbi, Peshawar, Pakistan; Cheprahar Hadda, Mia Omar Sabaqah School, Jalalabad, Afghanistan [SDGT]` [download]	[reply] [d/l] [select]
Re^3: Fine tuning a reg exp by LonelyPilgrim (Beadle) on Feb 23, 2012 at 17:59 UTC
Is there anything (punctuation, perhaps? placement with other words and terms?) that will consistently distinguish a name from any other proper noun in your text? For example, how can your script consistently distinguish between "Ibrahim Ali Muhammad" and "Grand Trunk Road" and "Pushtoon Garhi Pabbi", since all use the same capitalization scheme? You might have to define some more complicated criteria for recognizing names. Or will names only be in the headings of each entry, i.e. toward the beginning? In general, you would want: `$line =~ s{($regexp)}{<name>$1</name>}g;` The 'g' flag may or may not be needed, depending on what you're doing. If there's more than one name in a line, that would catch it. If there's only one name, you don't need it. The parentheses () match the name in your line and place it in $1, so you can put the tags around it in your replacement expression. Using curly brackets {} instead of / to mark your regexp avoids having to escape your slashes ("leaning toothpick syndrome," I think someone called it -- it can get confusing!). Any other characters could be used to delimit your regexp if you'd prefer. What I have above is equivalent to this: `$line =~ s/($regexp)/<name>$1<\/name>/g;`	[reply] [d/l] [select]
Re^4: Fine tuning a reg exp by markjrouse (Initiate) on Feb 23, 2012 at 20:05 UTC
Re^5: Fine tuning a reg exp by LonelyPilgrim (Beadle) on Feb 23, 2012 at 20:25 UTC
Re^2: Fine tuning a reg exp by tchrist (Pilgrim) on Feb 23, 2012 at 23:17 UTC
`/^[[:upper:]]+$/` [download] I�ve never understood why people use that instead of the much easier to type, read, and use `\p{upper}`. Can you tell me why?	[reply] [d/l]
Re^3: Fine tuning a reg exp by choroba (Cardinal) on Feb 24, 2012 at 00:11 UTC
Because the former works in a shell and sed, too?	[reply]
Re^4: Fine tuning a reg exp by runrig (Abbot) on Feb 24, 2012 at 01:12 UTC
Re: Fine tuning a reg exp by Marshall (Canon) on Feb 23, 2012 at 21:08 UTC
Before proceeding further with this, I think that it should be noted that "SDGT" is a buzzword for "Specially Designated Global Terrorists". Al-Libi, Abd al-Muhsin or Ibrahim Ali Muhammad is on the "Most Wanted Terrorist" list. Normally I would help anybody with anything related to Perl. However in this case, I would like to hear more about who you are and why you are doing this? And why you do not have access to the more easily parseable databases? I hope that you realize that parsing a terrorist list is a very touchy subject. Update: If you are getting this info from a public URL, then post that URL. Posting anything like this from a US government internal database, even just a short excerpt, is not appropriate here.	[reply]
Re^2: Fine tuning a reg exp by choroba (Cardinal) on Feb 23, 2012 at 22:05 UTC
Google shows this: http://www.treasury.gov/ofac/downloads/sdnlist.txt.	[reply]
Re^3: Fine tuning a reg exp by Marshall (Canon) on Feb 23, 2012 at 22:12 UTC
I think the OP should post the URL that he is working from. Working from the whole list will make it easier to understand and parse out what he needs. If the info is public, I have no problem with making it "easier to understand" via re-formatting. And I would help with that. I personally feel "very uncomfortable" if the full info is not available to the general public and the OP's info looks more specific than what I could find. Something like this has not come up before in my time on Monks. Your URL comes up with: `ABU BAKR, Ibrahim Ali Muhammad (a.k.a. AL-LIBI, Abd al-Muhsin; a.k.a. SABRI, Abdel Ilah; a.k.a. TANTOUCHE, Ibrahim Abubaker; a.k.a. TANTOUSH, Ibrahim Abubaker; a.k.a. TANTOUSH, Ibrahim Ali Abu Bakr; a.k.a. "'ABD AL-MUHSI"; a.k.a. "'ABD AL-RAHMAN"; a.k.a. "ABU ANAS"); DOB 1966; alt. DOB 27 Oct 1969; nationality Libya; Passport 203037 (Libya) (individual) [SDGT]` [download] Fine. Yes. I know this guy in on the terrorist list. But it doesn't show all of the info that the OP had although it shows additional information. I think my pointing out that this is a terrorist list was appropriate. Let's see what the OP has to say and we go from there.	[reply] [d/l]
Re^4: Fine tuning a reg exp by markjrouse (Initiate) on Feb 23, 2012 at 23:30 UTC
Re^5: Fine tuning a reg exp by Marshall (Canon) on Feb 24, 2012 at 03:49 UTC
Re^4: Fine tuning a reg exp by choroba (Cardinal) on Feb 24, 2012 at 00:08 UTC
Re^5: Fine tuning a reg exp by Marshall (Canon) on Feb 24, 2012 at 04:13 UTC
Re^4: Fine tuning a reg exp by markjrouse (Initiate) on Feb 26, 2012 at 13:13 UTC