Performing a grep-like action multiple times on a single line.

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to parse a word list that looks like this:

dog
epithet
mannerism
...
[download]

From this list, I would like to extract all unique grapheme contexts for a user specified grapheme, with a user specified context limit. The code below does this for the first occurrence of said grapheme, but I don't know how to run the action multiple times on the same word (e.g. the two 'e's in epithet). Here is the meat of what I have so far, which works fine for the first occurrence of the grapheme.

# $num = user specified context maximum
# $graph = user specified grapheme to search for
# note: this also holds the examples corresponding to the 
# given context.

while(<INPUT>) {

    chomp($_);
    if( /$graph/ ) {
    ($front, $back) = /(.{0,$num})$graph(.{0,$num})/;
    $con = $front.'_'.$back;
    push( @{$graph_contexts{$con}}, $_);
    } else {
    next;
    }
}
[download]

Comment on Performing a grep-like action multiple times on a single line. Select or Download Code

Replies are listed 'Best First'.
Re: Performing a grep-like action multiple times on a single line. by graff (Chancellor) on Mar 20, 2004 at 05:08 UTC
I think this sort of problem would be better served using the "index" and "substr" functions, rather than a regex. You need to present both the target character (or string being searched for) together with some user-specified range of its context -- but for a regex to do that, it has to capture the context as well as the target character or string. But what if the context contains the next occurrence of the target? (e.g., the target is "e", the user wants to see 4 characters on either side, and the string is "many obese people" -- the following context for the first target consumes the next two targets, so when the first regex match takes them, they are not available for the next match). Assuming you actually want to store all the matches, here's a possible approach using index and substr (assuming your "$graph" and "$num" represent the target character(s) and the context size): `while (<>) { chomp; my $offset = 0; my $limit = length(); while (( my $found = index( $_, $graph, $offset )) >= 0 ) { my $bgn = ( $found - $num > 0 ) ? $found - $num : 0; my $end = ( $found + $num +1 < $limit ) ? $found + $num +1 : $ +limit; push @{$graph_contexts{substr( $_, $bgn, $end - $bgn )}}, $_; $offset = $found + 1; } }` [download] (It takes a little practice to get around the "off-by-one" types of errors with this kind of approach, but once you solve that, it's fine. In this approach, if the context size is, say, 4 characters before and after, but the target shows up as the 2nd or last character in the string, the target will still be captured, and will include the shorter context.) A more "brute force" (effective but perhaps less efficient) approach would be to simply go through all the substrings of $num2+1 characters, and keep the ones that have $graph in the center position: `$sublen = $num 2 + 1; while (<INPUT>) { chomp; for my $ofs ( 0 .. length()-$sublen ) { my $ngram = substr( $_, $ofs, $sublen ); next unless $ngram =~ /^.{$num}$graph/; # store this ngram to your hash } }` [download] (This one will only do the matches that have full context ($num characters) before and after the character being matched.)	[reply] [d/l] [select]
Re: Performing a grep-like action multiple times on a single line. by dragonchild (Archbishop) on Mar 19, 2004 at 19:58 UTC
Try throwing a g modifier at the end of that regex and see what happens. ------ We are the carpenters and bricklayers of the Information Age. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply]