Dear Monks,
I'm trying to teach myself some text mining and am parsing some letters that were originally handwritten. I'm trying to write a sub that finds a pattern that resembles a possible name based on the text being two capitalised words (I'm looking to refine this later but I'm trying to get the basics working first). My sub finds the bigram which I was trying to split and uses a slice to identify both relevant parts. I'm trying to return the possible words if the first word does not equal [Sidenote: but so far it is only returning the [Sidenote: and any following text ([Sidenote is on the top of each letter as a delimiter to split them). I was wondering about a possible hash on $word and then sorting by keys afterward but there is no guarantee that a name is mentioned in the letter.
use strict; use warnings; my $text = "letter.txt"; #open the file open my $fh, '<', $text or die "Can't read $text, $!"; my $letter = do { local $/; <$fh> }; close $fh; $letter =~ s/\s+\*//g; my @sidenotes = split /(?=\[Sidenote:)/, $letter; foreach my $text (@sidenotes) { my $name = find_name($text); print " $name\n"; } #sub to find possible names in the text sub find_name { my $name; my $n_text = shift or die "no text passed"; my @word = split ' ', $n_text; @word =~ m/\w{2}/i; foreach my $word (@word) { if ($word =~ m/^[A-Z]/) { if ($word[0] ne "[Sidenote:") { $name = $word[0]." ".$word[1]; } } } return $name; }
I'd be grateful for any pointers into improving this sub. Thanks.

In reply to Finding a capitalised pair of words in a text by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.