Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hi
I'm very new to perl and I'm hoping you can help me with the following query. I wish to extract the surnames from the following webpage and reprint them in the following format -
http://www.daire.org/names/scotsurs2.html
"MacAchallies", "McAchallies", "MacAchounich", "McAchounich", "MacAdam", "McAdam", "MacAdie", "McAdie", "MacAindra", "McAindra" ...
I've looked at regex tutorials but I don't know enough to do this. Any suggestions? Thanks.
if ( $mystring=~/^Mac\w/ ) { print "\"",$mystring ,"\",\n"; };

Replies are listed 'Best First'.
Re: general regex question
by TedPride (Priest) on May 14, 2005 at 15:33 UTC
    open ($handle, 'surnames.html'); while (<$handle>) { $hash{$1} = () if m/^<\w+?><B>(\w+?)<\/B>/; } close ($handle); print '"' . join ("\",\n\"", sort keys %hash) . '"';
      To expand a little on TedPride's terse code segment, he is suggesting you open the page in your browser, save it as "surnames.html" in the same directory as a script with his code, then run the script and feed "surnames.html" to it through STDIN (like perl script.pl < surnames.html).

      Also note that the bold tags in the script are lower-case, so you probably want to make your match case-insensitive by adding the /i modifier to your regex. Finally, if all you're interested in is the "Mac"s (as your original code fragment suggests), you might want to update the regular expression a little:
      open ($handle, 'surnames.html'); while (<$handle>) { $hash{$1} = () if m/^<\w+?><B>(Mac\w+?)<\/B>/i; } close ($handle); print '"' . join ("\",\n\"", sort keys %hash) . '"';

      [id://TedPride]'s code has taken advantage of the fact that all the surnames are between <b> tags. The regular expression matches any text that starts with a general tag (<\w+?>, like "<br>"), is then followed by a bold tag (<B>) and a name starting with "Mac" and one or more following characters ((Mac\w+?)). The parentheses "capture" the matched text to the special variable $1.

      Updated after [id://Anonymous Monk]'s post below.
        then run the script and feed "surnames.html" to it through STDIN (like perl script.pl < surnames.html)
        That's not what he's suggesting at all. He's using open to open surnames.html.
Re: general regex question
by holli (Abbot) on May 14, 2005 at 14:56 UTC
    my @a = ( "MacAchallies", "McAchallies", "MacAchounich", "McAchounich", "MacAdam", "McAdam", "MacAdie", "McAdie", "MacAindra", ); for ( @a ) { print "$1\n" if /^Ma?c(.+)/ }


    holli, /regexed monk/
Re: general regex question
by polettix (Vicar) on May 14, 2005 at 15:03 UTC
    Uhmm, it's difficult to answer if you don't explain what's inside $mystring: the regex could be correct or not depending on this. Could you post all the relevant parts of the code you've written?

    Flavio (perl -e 'print(scalar(reverse("\nti.xittelop\@oivalf")))')

    Don't fool yourself.
Re: general regex question
by fauria (Deacon) on May 14, 2005 at 15:08 UTC
    my @names = qw "MacAchallies McAchallies MacAchounich McAchounich MacA +dam McAdam MacAdie McAdie MacAindra"; for(@names){ $_ =~ /M.?c*/ and print $'."\n"; }
      fauria:

      Your regex, "/M.?c*/" will do fine at finding "Mac" or "Mc" and printing the REST of the name (rather than what the OP sought), and recall that c* matches ZERO OR MORE "c"s, meaning non-names, like Mccccc (ill-formatted roman numerals??) will match.

      Also, as written, (without capturing parens) it prints:

      
      Achallies
      Achallies
      Achounich
      Achounich
      Adam
      Adam
      Adie
      Adie
      Aindra
      

      whereas (with slightly different use of quantifiers and a couple additional names as test cases), this appears to work as requested:

      #!C:perl/bin my @names = qw "M Mac McA Mcccccc MacAchallies McAchallies MacAchounic +h McAchounich MacAdam McAdam MacAdie McAdie MacAindra"; for(@names){ $_ =~ /(M.?c+.*)/ and print "$1\n"; } =HEAD output is: Mac McA Mcccccc MacAchallies McAchallies MacAchounich McAchounich MacAdam McAdam MacAdie McAdie MacAindra =cut
      Noted also, that using $' (and friends) incurs a lot of overhead.
      And, just because they're not here, use strict; use warnings.
Re: general regex question
by sh1tn (Priest) on May 14, 2005 at 15:25 UTC
    my @a = ( "MacAchallies", "McAchallies", "MacAchounich", "McAchounich", "MacAdam", "McAdam", "MacAdie", "McAdie", "MacAindra", ); print grep { s/^ma?c(.+)/$1\n/i } @a;