in reply to Pattern Matching With Regular Expressions
Next, you have a third foreach loop (nested within the second one), where you try to read the full content of FILE for each "$term" in "@inputs" -- but after the reading the file for the first "$term", you don't "rewind" the file, which means there's no more data to be read until you close that file and open another one. That's why you're not getting as many matches as you expect.
To address the latter problem, rethink your logic -- in general, reading from a file is expensive, compared to looping over the elements of an array. So read through the file once, and for each line you read from the file, loop over the elements in @inputs to look for matches. (There are ways to combine a list of patterns into a single regex, by simply joining them together with "|", but we don't need to go there.)
Another thing that will help is to use a text editor that makes it easy to do consistent indenting, and make your indentation consistent, to reflect looping and conditions.
There's a lot more that could be done to make the code easier to read, less bulky, and generally better. Here's a one way to start:
A lot of your stuff with @before and @after is probably more complicated than it needs to be, but I didn't look at that part so closely... Maybe if you could describe in English (and/or with basic examples) what you're trying to accomplish, you'll figure out an easier way (and maybe the monks can help with that).sub findtext { my ($fileargs, $inputs) = @_; # pass references to arrays my @filenames; for my $arg ( @$fileargs ) { # dereference this array push @filenames, grep /\w/, split( /\W+/, $arg ); } for my $file ( @filenames ) { unless open( FILE, "/home/jroberts/$file.txt" ) { warn "open failed on $file: $!"; next; } while (<FILE>) { for my $term ( @$inputs ) { # dereference this array next unless ( /\b$term\b/ ); # get here when there's a match... # (not sure what you want to do here) } } } }
I don't really know where your @inputs is coming from, but I'd suggest that you pass it in as an array reference, to keep the subroutine "modular" (i.e. not dependent on a surrounding context of global variables -- this can be another good side-effect of "use strict"). Regarding this array, do be careful about metacharacters in the array elements -- things like ".&+@%^$" and brackets contained within $item $term will have their magical regex significance unless you put "\Q" and "\E" around the variable when doing the match.
Update: I get it now -- you're building a concordancer, that will produce a listing of "key words in context" (KWIC). This is a great exercise for honing your perl skills (even though there are numerous open-source and free-ware packages available on the web to do this already -- do a google search for "KWIC"). One suggestion: don't limit your context to individual lines of text -- line breaks are an arbitrary disruption of linguistic content, and it's better to just ignore them. Here's one way, assuming that your .txt files really are just plain text (without any markup or other noise):
my %input; $input{$_} = undef for @$inputs; # make this a hash $/ = undef; # look up $/ in perldoc perlvar unless ( open( FILE, ...)) { #blah next; } $text = <FILE>; # $text holds entire file content. $text =~ s/\n\n/ <P> /; # (optional: preserve paragraph boun +daries) @words = split( /\s+/, $text ); for my $i (0..$#words) { next unless ( exists( $input{$word[$i]} )); # get here when there's a match, output matched # word along with "N" words of surrounding context # (left as an exercise...) }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Pattern Matching With Regular Expressions
by Anonymous Monk on Apr 13, 2004 at 04:37 UTC | |
by graff (Chancellor) on Apr 13, 2004 at 04:48 UTC | |
by Anonymous Monk on Apr 13, 2004 at 05:01 UTC | |
by graff (Chancellor) on Apr 13, 2004 at 05:55 UTC | |
by Anonymous Monk on Apr 28, 2004 at 03:00 UTC | |
|