Given that you are parsing HTML I'd strongly recommend using a module such as HTML::TreeBuilder to do the heavy lifting for you. Consider the following:

use strict; use warnings; use HTML::TreeBuilder; my @goodWordsList = ( "mhm", "right", "well", "yeah", "sure", "good", "ah", "okay", "yep +", "hm", "definitely", "alright", "'m'm", "oh", "my", "god", "wow", "uhuh", + "exactly", "yup", "mkay", "i see", "ooh", "cool", "uh", "fine", "true", "hm'm +", "hmm", "yes", "absolutely", "great", "um", "so", "mm", "weird", "ye-", "i + mean", "i know", "i think so", "huh", "yay", "maybe", "eh", "obviously", +"correct", "awesome", "really", "interesting", ); my %goodwords; @goodwords{@goodWordsList} = (1) x @goodWordsList; my $root = HTML::TreeBuilder->new (); $root->parse_file (*DATA); my %speakers; # Parse out speaker attributes for ($root->look_down ('_tag', 'strong')) { my $info = $_->right (); my $name = $_->as_text (); $speakers{$name}{info} = $info; for my $param (split /\s*(?:;\s*|$)/, $info) { my ($key, $value) = $param =~ /^:?\s*([^:]*):\s*(.*)/; $speakers{$name}{$key} = $value; } } my %stats; # Do the analysis for ($root->look_down ('_tag', 'p')) { my $line = $_->as_text ();; my ($name) = $line =~ /(\w+):/; # Preform analysis on paragraph here }
__DATA__ <strong>S1</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Senior Undergraduate; Gender: Male; Age: 17-23 +; Restriction: None<br> <strong>S2</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Researcher; Gender: Male; Age: 31-50; Restrict +ion: Cite<br> <strong>S3</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Junior Undergraduate; Gender: Female; Age: 17- +23; Restriction: None<br> <strong>S4</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Senior Undergraduate; Gender: Female; Age: 17- +23; Restriction: None<br> <strong>S5</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Junior Undergraduate; Gender: Female; Age: 17- +23; Restriction: None<br> <strong>SS</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Unknown; Gender: Male; Age: Unknown; Restricti +on: None<br> <p><b>S1: </b> it was presented to them by Chuck D and Public Enemy. +<font color="#ff6600"><b> [S2: </b> mhm <b> ] </b></font> and the re +st of th- Public Enemy and you know and and Chuck D's f- publicly get +s up and says you know they were with us from the beginning and, <fo +nt color="#ff6600"><b> [S2: </b> <font color="#3333ff"> mhm </font> +<b> ] </b></font> <font color="#3333ff"> all that </font> now wheth- +whether or not you know that he was reading a TelePrompTer, <font co +lor="#ff6600"><b> [S2: </b> mhm <b> ] </b></font> or or not i i thin +k is uh </p> <p><b>S2: </b> or if he was trying to make nice because of the fact t +hat Public Enemy hasn't sold records lately, <font color="#ff6600">< +b> [S1: </b> right <b> ] </b></font> and he doesn't wanna look like +some kinda old sourpuss </p>

which parses out all of the speaker attributes into %speakers, then iterates over the paragraphs pulling out speaker names and doing whatever arcane thing it is you need to do for each paragraph. Note that there is a lot of error checking not done. If the structure of the text differs from the sample then you will most likely get run time errors and warnings. On the other hand, your current parsing is much more fragile (actually, broken even).


DWIM is Perl's answer to Gödel

In reply to Re: minimal response program code problem by GrandFather
in thread minimal response program code problem by Katy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.