Katy has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm very new to programming and I am having problems writing a program. I realize that you do not like to answer questions regarding homework and I want to use this program in a project for a course in computational linguistics, but I have never programmed before the past month, I have attempted to write a program to solve my problem, and I have looked through numerous online tutorials and sources. I am not sure where else to turn. My code looks like this:

#!/usr/local/bin/perl -w # Program to count minimal pairs by gender # Author: Katy # Last modified: December 2006 %Goodwords = ("mhm" => 1, "right" => 1, "well" => 1, "yeah" => 1, "sure" => 1, "good" => 1, "ah" => 1, "okay" => 1, "yep" => 1, "hm" => 1, "definitely" => 1, "alright" => 1, "'m'm" => 1, "oh" => 1, "my" => 1, "god" => 1, "wow" => 1, "uhuh" => 1, "exactly" => 1, "yup" => 1, "mkay" => 1, "i see" => 1, "ooh" => 1, "cool" => 1, "uh" => 1, "fine" => 1, "true" => 1, "hm'm" => 1, "hmm" => 1, "yes" => 1, "absolutely" => 1, "great" => 1, "um" => 1, "so" => 1, "mm" => 1, "weird" => 1, "ye-" => 1, "i mean" => 1, "i know" => 1, "i think so" => 1, "huh" => 1, "yay" => 1, "maybe" => 1, "eh" => 1, "obviously" => 1, "correct" => 1, "awesome" => 1, "really" => 1, "interesting" => 1,); while(<>){ if(/<strong>(S[\w\-]+)<strong>:.*Gender:\s+(Male|Female)/i){ $speakerID = $1; $speaker{$1}=$2;} if($good){ $good = 1; $minresgen{$speaker{$speakerID}}++;} $_=~ /\[(S[\w\-]+):(.*?)\]/i; use diagnostics; $minres = $2; $speakerID = $1; $minres =~ s/<.*?>//g; $minres = lc($minres); $good = 1; while($minres =~ /(\w+)/g) { unless($Goodwords{$1}) { $good = 0; last; } } print "Male:"; print $minresgen {"Male"}; print "Female:"; print $minresgen {"Female"}; }
My data are transcriptions of conversations and a sample looks like this:
<strong>S1</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Senior Undergraduate; Gender: Male; Age: 17-23 +; Restriction: None<br> <strong>S2</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Researcher; Gender: Male; Age: 31-50; Restrict +ion: Cite<br> <strong>S3</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Junior Undergraduate; Gender: Female; Age: 17- +23; Restriction: None<br> <strong>S4</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Senior Undergraduate; Gender: Female; Age: 17- +23; Restriction: None<br> <strong>S5</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Junior Undergraduate; Gender: Female; Age: 17- +23; Restriction: None<br> <strong>SS</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Unknown; Gender: Male; Age: Unknown; Restricti +on: None<br> <p><b>S1: </b> it was presented to them by Chuck D and Public Enemy. +<font color="#ff6600"><b> [S2: </b> mhm <b> ] </b></font> and the re +st of th- Public Enemy and you know and and Chuck D's f- publicly get +s up and says you know they were with us from the beginning and, <fo +nt color="#ff6600"><b> [S2: </b> <font color="#3333ff"> mhm </font> +<b> ] </b></font> <font color="#3333ff"> all that </font> now wheth- +whether or not you know that he was reading a TelePrompTer, <font co +lor="#ff6600"><b> [S2: </b> mhm <b> ] </b></font> or or not i i thin +k is uh </p> <p><b>S2: </b> or if he was trying to make nice because of the fact t +hat Public Enemy hasn't sold records lately, <font color="#ff6600">< +b> [S1: </b> right <b> ] </b></font> and he doesn't wanna look like +some kinda old sourpuss </p>
I want to get an output that looks like: "Male:" number of times that male speakers use any of the words in the hash singularly or in combination with each other when they are within brackets. "Female:" number of times that female speakers use any of the words in the hash singularly or in combination with each other when they are within brackets. When I run the program, the only output are four warning messages that repeat for each line of text that it processes, for example, line 557:
Use of uninitialized value in hash element at C:\Documents and Settings\Owner\Desktop\minres10.pl.txt line 62, <> line 557 (#1) Use of uninitialized value in substitution (s///) at C:\Documents and Settings\Owner\Desktop\minres10.pl.txt line 67, <> line 557 (#1) Use of uninitialized value in print at C:\Documents and Settings\Owner\Desktop\minres10.pl.txt line 78, <> line 557 (#1) Use of uninitialized value in print at C:\Documents and Settings\Owner\Desktop\minres10.pl.txt line 80, <> line 557 (#1)
and then there is a block of "Male:Female:Male:Female:" at the bottom of the output. When I used "use strict;" the program did not run and it sent back messages:
Variable "$speakerID" is not imported at C:\Documents and Settings\Owner\Desktop\minres10.pl.txt line 67 (#1) (F) While "use strict" in effect, you referred to a global variable that you apparently thought was imported from another module, because something else of the same name (usually a subroutine) is exported by that module. It usually means you put the wrong funny character on the front of your variable. Variable "$good" is not imported at ... line 70 (#1) Variable "%Goodwords" is not imported at ... line 72 (#1) Variable "$good" is not imported at ... line 73 (#1) Variable "%minresgen" is not imported at ... line 79 (#1) Variable "%minresgen" is not imported at ... line 81 (#1) Global symbol "$minres" requires explicit package name at ...line 66 ..."$speakerID" ...line 67 ..."$minres" ... line 68 ... "$minres" ... line 69 ... "$minres" ... line 69 ... "$good" ... line 70 ... "$minres" ... line 71 ..."%Goodwords" ... line 72 ..."$good" ... line 73 ..."%minresgen" ... line 79 ..."%minresgen" ... line 81 (F) You've said "use strict vars," which indicates that all variables must either be lexically scoped (using "my"), declared beforehand using "our," or explicitly qualified to say which package the global variable is in (using "::").
I am using ActivePerl 5.8.8.819 Windows (x86) AS package. Any advice on how I should change my script to make the program give me my desired output would be greatly appreciated. Thanks, Katy

Replies are listed 'Best First'.
Re: minimal response program code problem
by jbert (Priest) on Dec 05, 2006 at 13:57 UTC
    Firstly, in order to run under use strict and use warnings (which is a good idea, it will show up various other problems), you need to declare the variables you are using.

    Ordinarily, you will want 'lexical' variables. Lexical is a bit of jargon which means that the variable is in scope (i.e. is valid) over a single chunk of your program. You can declare your variables with: my $speakerID; or several at-a-time with my ($speakerID, $minres); etc. Your initialisation of %GoodWords can also be turned into a declaration, by writing my %GoodWords = ( ... );

    The purpose of doing this is primarily organisational. If you accidentally mis-spell a variable name you'll get a warning because you're attempting to use an undeclared variable. It also has more significant benefits when you're writing larger, more structured programs.

    Next - your regular expression is failing to match your initial lines. This is because you aren't matching the closing tag correctly, it should be </strong>. However, as you've probably noticed, you can't put a / directly in your regexp as is, because you're using that character to delimit the beginning and end. The easiest solution is to use another character to delimit the match (in which case you also need a leading 'm', like this: m!<strong>(S[\w\-]+)</strong>:.*Gender:\s+(Male|Female)!.

    Also, if you've matched a line as defining a user, you don't want to carry on and try and process it as a spoken sentence. So you should either use a big if/else in the loop or, better, simply go around for the next iteration of the loop. (i.e. put a next in the if (m!<strong>..!) { ...; next; }

    If your sentence recognising match fails (for example on a blank line), you might want to skip the rest of the loop. You could put an if(!/.../) { next; } in there to do this. But perhaps preferably (since we all make mistakes writing regexps from time to time, it is better to explicitly ignore blank lines and then warn if your regular expression doesn't match.

    You can print multiple things in a print statement like this: print "Male:", $minresgen {"Male"}, "\n"; (I've also put a newline on the end there).

    I don't know if you're meaning to print out the Male/Female info for every line. I suspect not, so you'll probably want to move those print statements out of the main loop and put them afterwards. If you do, you might want to initialise those items to zero so that you don't get warnings if a particular line doesn't contain both male and female.

    Sorry for a long rambling post, but it looks like you've come a long way yourself - good look on getting the rest of the way.

Re: minimal response program code problem
by johngg (Canon) on Dec 05, 2006 at 14:38 UTC
    jbert and liverpole have given you some pointers to get you further along. This post is to show you how to save a lot of typing when setting up your %Goodwords hash. Since all you are storing for values in the hash is 1, you can set all of the keys up first in an array and then use a map or a foreach loop to set up the hash. Also, you can use the quote words operator qw{ ... } to set up those keys that don't contain spaces to save keep typing quote marks. First let's set up the list of keys

    use strict; use warnings; # Set up the keys that contain no spaces. # my @listOfGoodwords = qw{ mhm right well yeah sure good ah okay yep hm definitely alright 'm'm oh my god wow uhuh exactly yup mkay ooh cool uh fine true hm'm hmm yes absolutely great um so mm weird huh yay maybe eh obviously correct awesome really interesting ye-}; # Now add those keys that do contain spaces. # push @listOfGoodwords, q{i see}, q{i mean}, q{i know}, q{i think so};

    Note that the q{ ... } is the same as using single-quotes whereas qq{ ... } is the same as double-quotes. Variables and character escapes like \n for newline interpolate in double-quotes but not in single-quotes. It is good practice only to use double-quotes when you require interpolation and to use single-quotes at all other times.

    Now that we have the list we construct the hash, either with a foreach

    my %Goodwords; foreach my $key (@listOfGoodwords) { $Goodwords{$key} = 1; }

    or with a map

    my %Goodwords = map {$_ => 1} @listOfGoodwords;

    The map might look a little strange but all it is doing is passing each key in @listOfGoodwords into the map from the right (items passed into a map are held in the $_ variable) and passing two things, the key and value 1, out of the map to the left, thus populating the hash.

    I hope this is of use to you.

    Cheers

    JohnGG

    Update: Corrected typo in explanation of the map

      I quite often use a hash splice in that context:

      my %Goodwords; @Goodwords{@goodWordsList} = (1) x @listOfGoodwords;

      TIMTOWTDI and YMMV. ;-)


      DWIM is Perl's answer to Gödel
        Nice. GrandFather ++

        I've seen something like this in books a couple of times recently but it hasn't registered yet in my grab-bag of handy tools. Mainly because I still think of x as just a string multiplier and not as a list multiplier as well. I need to use it a couple of times so that it springs to mind on cue.

        Cheers,

        JohnGG

Re: minimal response program code problem
by liverpole (Monsignor) on Dec 05, 2006 at 13:59 UTC
    Hi Katy,

    Welcome to Perlmonks!  I hope your experience here, and with Perl in general, will be a rewarding one.

    It's a very good thing that you are using strict, and warnings (as provided by the -w).  They will help you to catch many potential problems early on.

    The easiest way to get rid of a lot of your warnings/errors is to define the lexical variables you use, by prefixing them with my, or, if they are used in many places, declaring them at the top of your program.

    For example:

    my %Goodwords; my $speakerID; my %speaker; my $good; my %minresgen; my $minres;

    See if that gets you farther along.

    s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
Re: minimal response program code problem
by zentara (Cardinal) on Dec 05, 2006 at 14:20 UTC
    Update: Set initial hash to zeros

    This may not be exactly what you are looking for (I didn't account for words in brackets...what ever that meant); but it should clarify how things should run. This version will count the number of times a male or female said the word, and count the total times the word was uttered. You could make your %goodwords a hash or hash (HoH) and have a separate count for each word for male and female. Like:

    %Goodwords =( 'well' =>{ 'm' =>0, 'f' => 0, }, ..... .... );

    But look at this to see how to make it run:

    </code>

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
Re: minimal response program code problem
by Melly (Chaplain) on Dec 05, 2006 at 14:59 UTC

    Hmm, well, I've had a go at re-writing your code.

    I won't claim that this is the 'best' code - or that I'd have written it this way if I'd started from scratch, but I think it does what you want (it also counts 'bad words', since it occurred to me that there's no point in counting good words unless you've got some kind of ratio to compare them to).

    use strict; my %goodwords = ("mhm" => 1, "right" => 1, "well" => 1, "yeah" => 1, "sure" => 1, "good" => 1, "ah" => 1, "okay" => 1, "yep" => 1, "hm" => 1, "definitely" => 1, "alright" => 1, "'m'm" => 1, "oh" => 1, "my" => 1, "god" => 1, "wow" => 1, "uhuh" => 1, "exactly" => 1, "yup" => 1, "mkay" => 1, "i see" => 1, "ooh" => 1, "cool" => 1, "uh" => 1, "fine" => 1, "true" => 1, "hm'm" => 1, "hmm" => 1, "yes" => 1, "absolutely" => 1, "great" => 1, "um" => 1, "so" => 1, "mm" => 1, "weird" => 1, "ye-" => 1, "i mean" => 1, "i know" => 1, "i think so" => 1, "huh" => 1, "yay" => 1, "maybe" => 1, "eh" => 1, "obviously" => 1, "correct" => 1, "awesome" => 1, "really" => 1, "interesting" => 1,); my(%speaker_record); # store the info in this hash in an array ref my $gender = 0; # array number for gender my $matched_words = 1; # array number for matched words count my $unmatched_words = 2; # array number for unmatched words count while(<DATA>){ if(/<strong>(S[\w\-]+)<\/strong>:.*Gender:\s+(Male|Female)/i){ $speaker_record{$1}->[$gender]=$2; $speaker_record{$1}->[$matched_words]=0; $speaker_record{$1}->[$unmatched_words]=0; } else{ # hopefully, a chunk contains just the stuff attributed to one spe +aker (split on <b>) my @chunks = split /<b>/, $_; foreach my $chunk(@chunks){ if($chunk =~ /(\w+?):/){ # who is the speaker of this chunk? my $speaker = $1; # get rid of stuff we don't want to count $chunk =~ s/<.*?>//g; # html tags and content $chunk =~ s/\[|\]//g; # '[' and ']' $chunk =~ s/(\w+?)://g; # the speaker my @words = split /\s+/, $chunk; # break the chunk up into wor +ds foreach my $word(@words){ #non-blank 'word' and valid speaker if($word !~ /^\s*$/ and exists $speaker_record{$speaker}){ # a matched goodword if(exists $goodwords{$word}){ $speaker_record{$speaker}->[$matched_words] ++; } # an unmatched word else{ $speaker_record{$speaker}->[$unmatched_words] ++; } } } } } } } foreach(keys %speaker_record){ print "Speaker: $_, Gender: $speaker_record{$_}->[$gender], "; print "Matched words: $speaker_record{$_}->[$matched_words], "; print "Unmatched words: $speaker_record{$_}->[$unmatched_words]\n"; } __DATA__ <strong>S1</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Senior Undergraduate; Gender: Male; Age: 17-23 +; Restriction: None<br> <strong>S2</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Researcher; Gender: Male; Age: 31-50; Restrict +ion: Cite<br> <strong>S3</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Junior Undergraduate; Gender: Female; Age: 17- +23; Restriction: None<br> <strong>S4</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Senior Undergraduate; Gender: Female; Age: 17- +23; Restriction: None<br> <strong>S5</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Junior Undergraduate; Gender: Female; Age: 17- +23; Restriction: None<br> <strong>SS</strong>: Native-Speaker Status: Native speaker, American E +nglish; Academic Role: Unknown; Gender: Male; Age: Unknown; Restricti +on: None<br> <p><b>S1: </b> it was presented to them by Chuck D and Public Enemy. +<font color="#ff6600"><b> [S2: </b> mhm <b> ] </b></font> and the re +st of th- Public Enemy and you know and and Chuck D's f- publicly get +s up and says you know they were with us from the beginning and, <fo +nt color="#ff6600"><b> [S2: </b> <font color="#3333ff"> mhm </font> +<b> ] </b></font> <font color="#3333ff"> all that </font> now wheth- +whether or not you know that he was reading a TelePrompTer, <font co +lor="#ff6600"><b> [S2: </b> mhm <b> ] </b></font> or or not i i thin +k is uh </p> <p><b>S2: </b> or if he was trying to make nice because of the fact th +at Public Enemy hasn't sold records lately, <font color="#ff6600"><b +> [S1: </b> right <b> ] </b></font> and he doesn't wanna look like s +ome kinda old sourpuss </p>
    Tom Melly, pm@tomandlu.co.uk

      There is a slight problem with this approach: it doesn't pick up 'multiword' items from %goodwords ("i see", "i mean", "i know", "i think so").

        Oops - very true... hmm, I guess I should loop through the goodwords keys and test each one against the chunk... although, given, in this example, very few of the goodwords are multi-word, it might be quicker to treat those as the exceptions and check for them seperately.

        map{$a=1-$_/10;map{$d=$a;$e=$b=$_/20-2;map{($d,$e)=(2*$d*$e+$a,$e**2 -$d**2+$b);$c=$d**2+$e**2>4?$d=8:_}1..50;print$c}0..59;print$/}0..20
        Tom Melly, pm@tomandlu.co.uk
Re: minimal response program code problem
by GrandFather (Saint) on Dec 05, 2006 at 22:59 UTC

    Given that you are parsing HTML I'd strongly recommend using a module such as HTML::TreeBuilder to do the heavy lifting for you. Consider the following:

    use strict; use warnings; use HTML::TreeBuilder; my @goodWordsList = ( "mhm", "right", "well", "yeah", "sure", "good", "ah", "okay", "yep +", "hm", "definitely", "alright", "'m'm", "oh", "my", "god", "wow", "uhuh", + "exactly", "yup", "mkay", "i see", "ooh", "cool", "uh", "fine", "true", "hm'm +", "hmm", "yes", "absolutely", "great", "um", "so", "mm", "weird", "ye-", "i + mean", "i know", "i think so", "huh", "yay", "maybe", "eh", "obviously", +"correct", "awesome", "really", "interesting", ); my %goodwords; @goodwords{@goodWordsList} = (1) x @goodWordsList; my $root = HTML::TreeBuilder->new (); $root->parse_file (*DATA); my %speakers; # Parse out speaker attributes for ($root->look_down ('_tag', 'strong')) { my $info = $_->right (); my $name = $_->as_text (); $speakers{$name}{info} = $info; for my $param (split /\s*(?:;\s*|$)/, $info) { my ($key, $value) = $param =~ /^:?\s*([^:]*):\s*(.*)/; $speakers{$name}{$key} = $value; } } my %stats; # Do the analysis for ($root->look_down ('_tag', 'p')) { my $line = $_->as_text ();; my ($name) = $line =~ /(\w+):/; # Preform analysis on paragraph here }

    which parses out all of the speaker attributes into %speakers, then iterates over the paragraphs pulling out speaker names and doing whatever arcane thing it is you need to do for each paragraph. Note that there is a lot of error checking not done. If the structure of the text differs from the sample then you will most likely get run time errors and warnings. On the other hand, your current parsing is much more fragile (actually, broken even).


    DWIM is Perl's answer to Gödel