Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I'm trying to teach myself some text mining and am parsing some letters that were originally handwritten. I'm trying to write a sub that finds a pattern that resembles a possible name based on the text being two capitalised words (I'm looking to refine this later but I'm trying to get the basics working first). My sub finds the bigram which I was trying to split and uses a slice to identify both relevant parts. I'm trying to return the possible words if the first word does not equal [Sidenote: but so far it is only returning the [Sidenote: and any following text ([Sidenote is on the top of each letter as a delimiter to split them). I was wondering about a possible hash on $word and then sorting by keys afterward but there is no guarantee that a name is mentioned in the letter.
use strict; use warnings; my $text = "letter.txt"; #open the file open my $fh, '<', $text or die "Can't read $text, $!"; my $letter = do { local $/; <$fh> }; close $fh; $letter =~ s/\s+\*//g; my @sidenotes = split /(?=\[Sidenote:)/, $letter; foreach my $text (@sidenotes) { my $name = find_name($text); print " $name\n"; } #sub to find possible names in the text sub find_name { my $name; my $n_text = shift or die "no text passed"; my @word = split ' ', $n_text; @word =~ m/\w{2}/i; foreach my $word (@word) { if ($word =~ m/^[A-Z]/) { if ($word[0] ne "[Sidenote:") { $name = $word[0]." ".$word[1]; } } } return $name; }
I'd be grateful for any pointers into improving this sub. Thanks.

Replies are listed 'Best First'.
Re: Finding a capitalised pair of words in a text
by jwkrahn (Abbot) on Jan 02, 2010 at 02:37 UTC
    @word =~ m/\w{2}/i;

    You have a pattern match in void context so whether or not it matched you are not doing anything with the result.    The binding operator (=~) uses two scalar operands so the array is used in scalar context which means that if @word contains 7 elements then your pattern match is "7" =~ m/\w{2}/i;.    Your pattern uses the \w character class which includes all upper and lower case letters so the use of the /i option is superfluous.

    foreach my $word (@word) { if ($word =~ m/^[A-Z]/) { if ($word[0] ne "[Sidenote:") { $name = $word[0]." ".$word[1];

    You are iterating through every element in @word and storing each in turn in the $word variable but you are always comparing only the first element of @word with "[Sidenote:" and assigning to $name only the first and second elements of @word every time.

    It sounds like you probably want to use a pattern something like:

    my @names = $letter =~ /\b[A-Z][a-zA-Z]+\s+[A-Z][a-zA-Z]+/g;
Re: Finding a capitalised pair of words in a text
by AnomalousMonk (Archbishop) on Jan 02, 2010 at 14:13 UTC

    I don't fully understand your requirements, but here's another possible approach to some kind of solution.

    Notes:

    • The example contains both false positives and false negatives.
    • The name 'John Jacob Astor' is split across two lines: there is a further processing step to squish some (but not all) whitespace in extracted names into a single blank.
    • An attempt is made to handle hyphenated names.

    >perl -wMstrict -le "my $letter = qq{One Mr. John Doe was a very good friend of John Jacob \n} . qq{Astor. They met often with William and Henry James and \n} . qq{Capt. Jean-Luc Picard at the quaint Tavern On The Green.}; print qq{''$letter''}; my $Capped = qr{ [[:upper:]] [[:lower:]]+ }xms; my $name = qr{ $Capped (?: - $Capped)* }xms; my $full_name = qr{ $name (?: \s+ $name)+ }xms; my @names = map { tr{ \t\n}{ }s; $_ } $letter =~ m{ $full_name }xmsg ; print qq{'$_'} for @names; " ''One Mr. John Doe was a very good friend of John Jacob Astor. They met often with William and Henry James and Capt. Jean-Luc Picard at the quaint Tavern On The Green.'' 'One Mr' 'John Doe' 'John Jacob Astor' 'Henry James' 'Jean-Luc Picard' 'Tavern On The Green'