I think this sort of problem would be better served using the "index" and "substr" functions, rather than a regex. You need to present both the target character (or string being searched for) together with some user-specified range of its context -- but for a regex to do that, it has to capture the context as well as the target character or string. But what if the context contains the next occurrence of the target? (e.g., the target is "e", the user wants to see 4 characters on either side, and the string is "many obese people" -- the following context for the first target consumes the next two targets, so when the first regex match takes them, they are not available for the next match).
Assuming you actually want to store all the matches, here's a possible approach using index and substr (assuming your "$graph" and "$num" represent the target character(s) and the context size):
while (<>) {
chomp;
my $offset = 0;
my $limit = length();
while (( my $found = index( $_, $graph, $offset )) >= 0 ) {
my $bgn = ( $found - $num > 0 ) ? $found - $num : 0;
my $end = ( $found + $num +1 < $limit ) ? $found + $num +1 : $
+limit;
push @{$graph_contexts{substr( $_, $bgn, $end - $bgn )}}, $_;
$offset = $found + 1;
}
}
(It takes a little practice to get around the "off-by-one" types of errors with this kind of approach, but once you solve that, it's fine. In this approach, if the context size is, say, 4 characters before and after, but the target shows up as the 2nd or last character in the string, the target will still be captured, and will include the shorter context.)
A more "brute force" (effective but perhaps less efficient) approach would be to simply go through all the substrings of $num*2+1 characters, and keep the ones that have $graph in the center position:
$sublen = $num * 2 + 1;
while (<INPUT>) {
chomp;
for my $ofs ( 0 .. length()-$sublen )
{
my $ngram = substr( $_, $ofs, $sublen );
next unless $ngram =~ /^.{$num}$graph/;
# store this ngram to your hash
}
}
(This one will only do the matches that have full context ($num characters) before and after the character being matched.) | [reply] [d/l] [select] |
Try throwing a g modifier at the end of that regex and see what happens.
------
We are the carpenters and bricklayers of the Information Age.
Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.
| [reply] |