general regex question

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: general regex question by TedPride (Priest) on May 14, 2005 at 15:33 UTC
`open ($handle, 'surnames.html'); while (<$handle>) { $hash{$1} = () if m/^<\w+?><B>(\w+?)<\/B>/; } close ($handle); print '"' . join ("\",\n\"", sort keys %hash) . '"';` [download]	[reply] [d/l]
Re^2: general regex question by crashtest (Curate) on May 14, 2005 at 16:58 UTC
To expand a little on TedPride's terse code segment, he is suggesting you open the page in your browser, save it as "surnames.html" in the same directory as a script with his code, then run the script ~~and feed "surnames.html" to it through STDIN (like `perl script.pl < surnames.html`)~~. Also note that the bold tags in the script are lower-case, so you probably want to make your match case-insensitive by adding the `/i` modifier to your regex. Finally, if all you're interested in is the "Mac"s (as your original code fragment suggests), you might want to update the regular expression a little: `open ($handle, 'surnames.html'); while (<$handle>) { $hash{$1} = () if m/^<\w+?><B>(Mac\w+?)<\/B>/i; } close ($handle); print '"' . join ("\",\n\"", sort keys %hash) . '"';` [download] [id://TedPride]'s code has taken advantage of the fact that all the surnames are between `<b>` tags. The regular expression matches any text that starts with a general tag (`<\w+?>`, like "<br>"), is then followed by a bold tag (`<B>`) and a name starting with "Mac" and one or more following characters (`(Mac\w+?)`). The parentheses "capture" the matched text to the special variable `$1`. Updated after [id://Anonymous Monk]'s post below.	[reply] [d/l] [select]
Re^3: general regex question by Anonymous Monk on May 15, 2005 at 04:25 UTC
then run the script and feed "surnames.html" to it through STDIN (like `perl script.pl < surnames.html`) That's not what he's suggesting at all. He's using `open` to open surnames.html.	[reply]
Re: general regex question by holli (Abbot) on May 14, 2005 at 14:56 UTC
`my @a = ( "MacAchallies", "McAchallies", "MacAchounich", "McAchounich", "MacAdam", "McAdam", "MacAdie", "McAdie", "MacAindra", ); for ( @a ) { print "$1\n" if /^Ma?c(.+)/ }` [download] holli, /regexed monk/	[reply] [d/l]
Re: general regex question by polettix (Vicar) on May 14, 2005 at 15:03 UTC
Uhmm, it's difficult to answer if you don't explain what's inside `$mystring`: the regex could be correct or not depending on this. Could you post all the relevant parts of the code you've written? Flavio (perl -e 'print(scalar(reverse("\nti.xittelop\@oivalf")))') Don't fool yourself.	[reply] [d/l]
Re: general regex question by fauria (Deacon) on May 14, 2005 at 15:08 UTC
`my @names = qw "MacAchallies McAchallies MacAchounich McAchounich MacA +dam McAdam MacAdie McAdie MacAindra"; for(@names){ $_ =~ /M.?c*/ and print $'."\n"; }` [download]	[reply] [d/l]
Re^2: general regex question by ww (Archbishop) on May 14, 2005 at 16:21 UTC
fauria: Your regex, "/M.?c/" will do fine at finding "Mac" or "Mc" and printing the REST of the name (rather than what the OP sought), and recall that c matches ZERO OR MORE "c"s, meaning non-names, like Mccccc (ill-formatted roman numerals??) will match. Also, as written, (without capturing parens) it prints: Achallies Achallies Achounich Achounich Adam Adam Adie Adie Aindra whereas (with slightly different use of quantifiers and a couple additional names as test cases), this appears to work as requested: `#!C:perl/bin my @names = qw "M Mac McA Mcccccc MacAchallies McAchallies MacAchounic +h McAchounich MacAdam McAdam MacAdie McAdie MacAindra"; for(@names){ $_ =~ /(M.?c+.)/ and print "$1\n"; } =HEAD output is: Mac McA Mcccccc MacAchallies McAchallies MacAchounich McAchounich MacAdam McAdam MacAdie McAdie MacAindra =cut` [download] Noted also, that using $' (and friends) incurs a lot of overhead. And, just because they're not here, use strict; use warnings*.	[reply] [d/l]
Re: general regex question by sh1tn (Priest) on May 14, 2005 at 15:25 UTC
`my @a = ( "MacAchallies", "McAchallies", "MacAchounich", "McAchounich", "MacAdam", "McAdam", "MacAdie", "McAdie", "MacAindra", ); print grep { s/^ma?c(.+)/$1\n/i } @a;` [download]	[reply] [d/l]