http://qs1969.pair.com?node_id=104604

When I was in grad school, I had to refer to concordances a lot, especially Shakespeare and Chaucer. I was trying to find a particular line in a Shakespeare play (Othello, to be exact) the other day and I thought that it would be an entertaining programming exercise to write a concordance generator. Just pass the code a text file and it will generate a full concordance, listing the number of times each word appears in the text, as well as the line numbers, or you can pass it a specific word and a text file, and it will return the line(s) that contain that word.

Now before everyone starts asking why I didn't use strict, my answer is that I did up until the moment I tried to use Getopt::Std. Obviously I'm missing something, but in order to pass strict, I had to declare my $opt variables. But when I did that, it ignored my command line flags. Any help in that regard would be greatly appreciated.

Update: Modified code. Still tweaking..... (btw, the line in Othello I was looking for was the line about throwing away a pearl worth more than the whole tribe. I don't remember why I was looking it up now, but it seemed important at the time.)
#!/usr/bin/perl #--------------------------------------------------------------------# # Concordance Generator # Date Written: 13-Aug-2001 04:02:11 PM # Last Modified: 14-Aug-2001 04:14:00 PM # Author: Kurt Kincaid # # This is free software and may be distributed under the # same terms as Perl itself. # # A simple concordance generator, particularly useful for linguistic # analysis. #--------------------------------------------------------------------# use strict; use vars qw($opt_h $opt_s); use Getopt::Std; my @theseWords; my @theseLines; my @found; my %Count; my %Line; my ( $line, $word, $count, $LineNum ); my $VERSION = "1.0"; getopts( "hs:" ); if ( $opt_h ) { Usage(); } my $file = shift || Usage(); open ( IN, $file ) || die "$file not found\n"; @theseLines = <IN>; close (IN); chomp @theseLines; if ( $opt_s ) { Word($opt_s); } foreach $line ( @theseLines ) { $count++; $line = lc $line; $line =~ s/[.,:;?!]//g; while ( $line =~ /\b\w+\b/g ) { $word = $&; if ( $word =~ /\s/ || $word eq "" ) { next } $Count{$word}++; if ( defined $Line{$word} ) { $Line{$word} =~ m/(\d*?)$/; if ( $1 == $count ) { next; } else { $Line{$word} .= ", $count"; } } else { $Line{$word} = $count; } # push @{$Line{$word}}, $count unless exists $Line{$word} && $L +ine{$word}[-1] == $count; } } @theseWords = keys %Count; @theseWords = sort @theseWords; foreach $word ( @theseWords ) { # print ( "$word ($Count{$word}): ", join ', ', @{$Line{$word}}, "\ +n\n" ); print ("$word ($Count{$word}): $Line{$word}\n\n"); } sub Word { my $word = shift; foreach $line ( @theseLines ) { $LineNum++; $Line{$line} = $LineNum; } @found = grep { /$word/i } @theseLines; foreach $line ( @found ) { print ("$Line{$line}: $line\n"); } exit; } sub Usage { print <<END; Concordance Generator v$VERSION $0 [-h] [-s word] filename -h Print this screen. -s Perform a search for a specific word with immediate context. END exit; }

Replies are listed 'Best First'.
Re: Concordance Generator
by chipmunk (Parson) on Aug 14, 2001 at 06:14 UTC
    I can guess that you declared yout $opt_... variables with my(). Since my() scopes the variables to the lexical block, the variables you declared are not visible to Getopt::Std.

    Instead, you should declare the variables with use vars: use vars qw($opt_h $opt_s); (In perl5.6, you could declare them with our instead.)

    Alternatively, you could switch to Getopt::Long, which allows you to specify whatever variables you want to get the values of the options. It has other useful features as well.

      Thank you. Yes, that's exactly what I was doing. I've incorporated the change into the posted code.
      ___________________
      Kurt
Re (tilly) 1: Concordance Generator
by tilly (Archbishop) on Aug 14, 2001 at 08:39 UTC
    Furthering what chipmunk said, you can also pass a reference to a hash to Getopt::Std and it will put data in that rather than in global variables.
Re: Concordance Generator
by John M. Dlugosz (Monsignor) on Aug 14, 2001 at 10:49 UTC
    More comments:

    my @theseWords = undef; is nonesense. Arrays are not undef. Only scalars are undef. It's being converted to (), which is what my @theseWords; by itself would do anyway.

    Now for the hash you are using (): my %Count = (); again, that's unnecessary since it's created empty in the first place.

    Now look at my ( $line, $word, $count, $LineNum ) = undef; what do you think that does? Since the undef is converted to an empty list (someone correct me if I'm wrong, please), then you have a list of 4 lvalues being assigned an empty list. That is treated like (undef, undef, undef, undef) which again is what my would have done anyway.

    More subtle, little-known issue: The $VERSION should be a number of a v-string, not a string. See Everything you wanted to know about Module Version Numbers and Checking for the issues and compatability things.

    open ( IN, $file ); ... or die;?

    Re: $word = $opt_s; Word(); That is nasty!. The variable $word is being used for loop control in a couple places, and the same variable is used for passing a parameter to this function. Pass a parameter instead of using a global! The only occurance of $word other than the ones which are loop control (set within the loop) is one use within Word(). Say: Word($opt_s) if $opt_s; and use the parameter in Word with my $word= shift;.

    Meanwhile take $word out of the globals at the top and add it to each foreach loop instead. With strict turned on, you will find these.

    undef @theseWords; @theseWords = split( / /, $line );
    Why zero out the array when you assign to it in the very next line? And again, you're using the same variable for different purposes throughout the code: it's essentially local to this outer foreach loop, and then it's used again for the print loop later. Use my to make it more local to the purpose, whether or not you use the same name.

    s/\s\s*?//g I'll let japhy tell you about that one. I think you want \s+?. But, if you split on whitespace (see docs on split) you don't need to do that at all! Extra blanks are no trouble.

    $Line{$word} =~ m/(\d*?)$/; if ( $1 == $count ) { next; } else { $Line{$word} .= ", $count";
    So you are keeping a list of found lines as a long string, and not adding the same number twice. Why not use a list instead of a long string, so you don't have to parse it out each time? You won't have to duplicate the assignment as a special case of being the first time, either:
    push @{$Line{$word}}, $count unless exists $Line{$word} && $Line{$word}[-1] == $count;
    Re sort { $a cmp $b }, that's what sort does normally. And it will run a lot faster if you don't use a sort block. So, leave off the useless comparison block.

    Keep on coding...
    —John </code>

      I re-posted the code using most of the suggestions. With regard to the version thing, I primarily include a version number for my own use, I wasn't trying to adhere to a standard. I didn't even know that there was a standard, to be honest. The sort block was a remnant from when I was going to try to keep the case of the words, rather than converting everything to lower case. I was going to do sort { lc($a) cmp lc($b) }, but ended up dropping the idea cuz it was getting too messy. (Dang, I had a remnant thing in my Fibonacci generator, too. Sorry 'bout that.)

      I tried to use your code for creating the list of line numbers, but I kept getting a trailing comma with the output. Any suggestions on that?
      ___________________
      Kurt
        I tried to use your code for creating the list of line numbers, but I kept getting a trailing comma with the output. Any suggestions on that?

        Easy: given $x is a list ref (stored as a hash element in the real program), say print join (',',@$x);

        re $VERSION, yea, it's a standard. You found the node on it? Short version for those just tuning in: you can specify a required version number when loading a module and it dies if the module found is too old.

        re preserving case but sorting case insensitive: look up the Swartzian Transform (spelling something like that).

        I'll go over your code again later tonight, perhaps.

        Keep it up!

        —John

Concordance Generator - TMTOWTDI
by John M. Dlugosz (Monsignor) on Aug 14, 2001 at 11:18 UTC
    These simple, common problems are good learning experiences.

    Here's mine, from a discussion on OO in January 1996.

    Notice that I used a totally different mechanism to prevent duplicate entries for one line.

    sub input { while (<>) { # parse the line into words my %temp; while (/\b\w+\b/g) { ++$temp{$&}; } # add the found words to my global results list my $key; while (($key)=each(%temp)) { push (@{$concord{$key}}, $.); } } } sub output { my @words= sort keys %concord; my $key; foreach $key (@words) { # print "$key @{$concord{$key}}\n" print ("$key ", join (', ', @{$concord{$key}}), "\n"); } } input; output;