When I was in grad school, I had to refer to concordances a lot, especially Shakespeare and Chaucer. I was trying to find a particular line in a Shakespeare play (Othello, to be exact) the other day and I thought that it would be an entertaining programming exercise to write a concordance generator. Just pass the code a text file and it will generate a full concordance, listing the number of times each word appears in the text, as well as the line numbers, or you can pass it a specific word and a text file, and it will return the line(s) that contain that word.
Now before everyone starts asking why I didn't use strict, my answer is that I did up until the moment I tried to use Getopt::Std. Obviously I'm missing something, but in order to pass strict, I had to declare my $opt variables. But when I did that, it ignored my command line flags. Any help in that regard would be greatly appreciated.
Update: Modified code. Still tweaking..... (btw, the line in Othello I was looking for was the line about throwing away a pearl worth more than the whole tribe. I don't remember why I was looking it up now, but it seemed important at the time.)
#!/usr/bin/perl
#--------------------------------------------------------------------#
# Concordance Generator
# Date Written: 13-Aug-2001 04:02:11 PM
# Last Modified: 14-Aug-2001 04:14:00 PM
# Author: Kurt Kincaid
#
# This is free software and may be distributed under the
# same terms as Perl itself.
#
# A simple concordance generator, particularly useful for linguistic
# analysis.
#--------------------------------------------------------------------#
use strict;
use vars qw($opt_h $opt_s);
use Getopt::Std;
my @theseWords;
my @theseLines;
my @found;
my %Count;
my %Line;
my ( $line, $word, $count, $LineNum );
my $VERSION = "1.0";
getopts( "hs:" );
if ( $opt_h ) {
Usage();
}
my $file = shift || Usage();
open ( IN, $file ) || die "$file not found\n";
@theseLines = <IN>;
close (IN);
chomp @theseLines;
if ( $opt_s ) {
Word($opt_s);
}
foreach $line ( @theseLines ) {
$count++;
$line = lc $line;
$line =~ s/[.,:;?!]//g;
while ( $line =~ /\b\w+\b/g ) {
$word = $&;
if ( $word =~ /\s/ || $word eq "" ) { next }
$Count{$word}++;
if ( defined $Line{$word} ) {
$Line{$word} =~ m/(\d*?)$/;
if ( $1 == $count ) {
next;
} else {
$Line{$word} .= ", $count";
}
} else {
$Line{$word} = $count;
}
# push @{$Line{$word}}, $count unless exists $Line{$word} && $L
+ine{$word}[-1] == $count;
}
}
@theseWords = keys %Count;
@theseWords = sort @theseWords;
foreach $word ( @theseWords ) {
# print ( "$word ($Count{$word}): ", join ', ', @{$Line{$word}}, "\
+n\n" );
print ("$word ($Count{$word}): $Line{$word}\n\n");
}
sub Word {
my $word = shift;
foreach $line ( @theseLines ) {
$LineNum++;
$Line{$line} = $LineNum;
}
@found = grep { /$word/i } @theseLines;
foreach $line ( @found ) {
print ("$Line{$line}: $line\n");
}
exit;
}
sub Usage {
print <<END;
Concordance Generator v$VERSION
$0 [-h] [-s word] filename
-h Print this screen.
-s Perform a search for a specific word with immediate context.
END
exit;
}
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.