comment on

I think this sort of problem would be better served using the "index" and "substr" functions, rather than a regex. You need to present both the target character (or string being searched for) together with some user-specified range of its context -- but for a regex to do that, it has to capture the context as well as the target character or string. But what if the context contains the next occurrence of the target? (e.g., the target is "e", the user wants to see 4 characters on either side, and the string is "many obese people" -- the following context for the first target consumes the next two targets, so when the first regex match takes them, they are not available for the next match).

Assuming you actually want to store all the matches, here's a possible approach using index and substr (assuming your "$graph" and "$num" represent the target character(s) and the context size):

while (<>) {
    chomp;
    my $offset = 0;
    my $limit = length();
    while (( my $found = index( $_, $graph, $offset )) >= 0 ) {
        my $bgn = ( $found - $num > 0 ) ? $found - $num : 0;
        my $end = ( $found + $num +1 < $limit ) ? $found + $num +1 : $
+limit;
        push @{$graph_contexts{substr( $_, $bgn, $end - $bgn )}}, $_;
        $offset = $found + 1;
    }
}
[download]

(It takes a little practice to get around the "off-by-one" types of errors with this kind of approach, but once you solve that, it's fine. In this approach, if the context size is, say, 4 characters before and after, but the target shows up as the 2nd or last character in the string, the target will still be captured, and will include the shorter context.)

A more "brute force" (effective but perhaps less efficient) approach would be to simply go through all the substrings of $num*2+1 characters, and keep the ones that have $graph in the center position:

$sublen = $num * 2 + 1;

while (<INPUT>) {
    chomp;
    for my $ofs ( 0 .. length()-$sublen )
    {
        my $ngram = substr( $_, $ofs, $sublen );
        next unless $ngram =~ /^.{$num}$graph/;

         # store this ngram to your hash
    }
}
[download]

(This one will only do the matches that have full context ($num characters) before and after the character being matched.)

In reply to Re: Performing a grep-like action multiple times on a single line. by graff
in thread Performing a grep-like action multiple times on a single line. by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.