in reply to A copyeditor needs help to get started with a Perl project
Personally, I'd stick to dealing with the text as text. The format of Word documents is horribly complex, badly if at all documented, and subject to variation from version to version of word. It also contains a heap of stuff that are nothing to do with the content or even it's markup. (I should also add that I don't use word for anything. wordpad is as much of a word processor as I will ever need.)
As a first pass, I would
Present the terms (+flags +counts) as a scrolling list.
And here's some crude, incomplete code to do most of that. Fixing it up to use a Win32::Console or one of the GUI interfaces etc. is left as an exercise :)
#! perl -slw use strict; use List::Util qw[ sum first ]; use Term::ReadKey; use Win32::Clipboard; $| = 1; my $clip = Win32::Clipboard; our $WORDS ||= 5; our $DICT ||= 'words'; my %dict; open WORDS, '<', $DICT or die "$DICT : $!"; chomp and $dict{ $_ } = 1 while <WORDS>; close WORDS; my $re_word = qr[\S+\s+]; my $re_context = qr[ ( $re_word {1,$WORDS} ( [''""]? \b \S+ (?: [''-]? \S+ )? [''""]? \b ) [;:!?,.]*? \s+ + $re_word {1,$WORDS} ) ]ix; open TEXT, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; local $/ = ''; my %terms; while( <TEXT> ) { tr[\n\r][ ]; s[\s+][ ]g; while( m[$re_context]g ) { my( $context, $term ) = ( $1, $2 ); $context =~ s[\b\Q$term\E\b][<<$term>>]; push @{ $terms{ $term } }, $context; } } close TEXT; ReadMode 2; my @terms = sort{ lc($a) cmp lc($b) || $a cmp $b } keys %terms; #printf "%d total words with %d unique terms\n", # sum map{ scalar @{ $terms{ $_ } } } keys %terms, # scalar keys %terms; my( $redraw, $c ); { for( my $i=0; $i < @terms; $i++ ) { my $term = $terms[ $i ]; $redraw and do { system 'cls'; for( my $j = $i - $redraw; $j <= $i; $j++ ) { printf "%-20.20s %1s : %d\n", $terms[ $j ], exists $dict{ lc $terms[ $j ] } ? ' ' : '*', scalar @{ $terms{ $terms[ $j ] } }; } $redraw = 0; }; printf "%-20.20s %1s : %d\t", $term, exists $dict{ lc $term } ? ' ' : '*', scalar @{ $terms{ $term } }; 1 until defined( $c = ReadKey .5 ); $c eq 'q' and last; $c eq '?' and do{ system 'cls'; print <<'USAGE'; <BACKSPACE> to backup. [j]ump + (a..z) -- jump to words beginning with <letter>. [c]ontext -- displays the context(s) surrounding the current + word. The context will be set to the clipboard as it is displayed to allow easy searching within an editor (Word). [?] -- this text. [q] (twice) -- exit program [q] + [r] -- to restart the list. [esc] (sometimes twice) to exit context and help modes. '*' next to a term indicates not found in dictionary. nn indicates the frequency of the termin teh text. <<word>> indicates the position of the term in context. USAGE ; 1 until defined( $c = ReadKey .5 ) and ord( $c ) == 27; $redraw = 10; }; $c =~ 'j' and do{ $c = undef; $c = lc( ReadKey 0.5 ) until defined $c and $c =~ m[[a-z]] +; $i = 0; $i++ while $i < @terms and $terms[ $i ] !~ m[^$c]i; $i--; print $/; $redraw = 10; next; }; $c eq "\b" and do{ $i -= 2 ; $i = -1 if $i < -1; printf( "\r" ); next; }; $c eq 'c' and do{ system 'cls'; for my $context ( @{ $terms{ $term } } ) { printf "%20.20s : ... %s ...\n", $term, $context; $context =~ s[<<\Q$term\E>>][$term]; $clip->Set( $context ); 1 until defined( $c = ReadKey .5 ); last if ord( $c ) == 27; } 1 until defined( $c = ReadKey .5 ) and ord( $c ) == 27; $redraw = 10; }; print $/; } printf 'Eof list [R]edo or [Q]uit'; 1 until defined( $c = lc( ReadKey 0.5 ) ) and $c =~ m[r|q]; print $/; redo if $c eq 'r'; }
A few screen shots :)
The main view
Encomium : 1 encomiums : 1 end : 3 ends : 2 enemies : 1 enigmatic : 1 enjoy : 1 enjoyed : 1 enough : 4 entertained : 1 Enthymeme : 3 enthymeme : 2 Enthymemes * : 2 enthymemes * : 1 envy : 3 Epicharmus * : 1 epigrammatic : 1 epilogue : 2 Epilogue-it * : 1 epilogues : 1 Epimenides * : 1 epithets : 2 equal : 1
The context view for 'Enthymemes':
Enthymemes : ... the premisses or conclusions of <<Enthymeme +s>>, considered apart from the rest ... Enthymemes : ... We now come to the <<Enthymemes>>, and will + begin the subject ...
Like I said, crude, but a simple, effective?, possible approach.
Janitored by Arunbear - added readmore tags, as per Monastery guidelines
|
|---|