in reply to general regex question

open ($handle, 'surnames.html'); while (<$handle>) { $hash{$1} = () if m/^<\w+?><B>(\w+?)<\/B>/; } close ($handle); print '"' . join ("\",\n\"", sort keys %hash) . '"';

Replies are listed 'Best First'.
Re^2: general regex question
by crashtest (Curate) on May 14, 2005 at 16:58 UTC
    To expand a little on TedPride's terse code segment, he is suggesting you open the page in your browser, save it as "surnames.html" in the same directory as a script with his code, then run the script and feed "surnames.html" to it through STDIN (like perl script.pl < surnames.html).

    Also note that the bold tags in the script are lower-case, so you probably want to make your match case-insensitive by adding the /i modifier to your regex. Finally, if all you're interested in is the "Mac"s (as your original code fragment suggests), you might want to update the regular expression a little:
    open ($handle, 'surnames.html'); while (<$handle>) { $hash{$1} = () if m/^<\w+?><B>(Mac\w+?)<\/B>/i; } close ($handle); print '"' . join ("\",\n\"", sort keys %hash) . '"';

    [id://TedPride]'s code has taken advantage of the fact that all the surnames are between <b> tags. The regular expression matches any text that starts with a general tag (<\w+?>, like "<br>"), is then followed by a bold tag (<B>) and a name starting with "Mac" and one or more following characters ((Mac\w+?)). The parentheses "capture" the matched text to the special variable $1.

    Updated after [id://Anonymous Monk]'s post below.
      then run the script and feed "surnames.html" to it through STDIN (like perl script.pl < surnames.html)
      That's not what he's suggesting at all. He's using open to open surnames.html.