in reply to Printing ten characters succeeding a matching string

Another way... perhaps more suited as a tutorial than as production code:

#!/usr/bin/perl use Modern::Perl; # 934221 # find a string, then print the next ten chars my @content = qw/abcdABCD1234567890xyz abcd12345ABCD0ABCD ABCD1234ABC +qwertyABCD1234567890/; for my $content(@content) { $content =~ /.+?ABCD(.{10}).*/; say "Current array element is: $content"; if ($1) { say "\t Next 10 char after the match: $1"; next; } }

And the result of executing this script is:

Current array element is: abcdABCD1234567890xyz Next 10 char after the match: 1234567890 Current array element is: abcd12345ABCD0ABCD Current array element is: ABCD1234ABC Current array element is: qwertyABCD1234567890 Next 10 char after the match: 1234567890

The second and third array elements don't satisfy the regex because there are NOT 10 chars after the last instance of the sequence ABCD.

SOLVED, below Now a question for wiser heads: add, immediately after creation of the array, another element to @content, namely, $content[4] with this line: push @content,  'ABCD 123 456 789';. Run the code. This is the output from 5.012 on a win32 box:

Current array element is: abcdABCD1234567890xyz Next 10 char after the match: 1234567890 Current array element is: abcd12345ABCD0ABCD Current array element is: ABCD1234ABC Current array element is: qwertyABCD1234567890 Next 10 char after the match: 1234567890 Current array element is: ABCD 123 456 789

Why doesn't the regex see a match in $content[4]

<UPDATE:> And WTH does this minor revision,

my @content = qw/abcdABCD1234567890xyz abcd12345ABCD0ABCD ABCD1234ABC +qwertyABCD1234567890/; push @content, 'ABCD 123 456 789'; say "===> \$content[4]: $content[4] \n\n"; for my $content(@content) { $content =~ /.+?ABCD(.{10}).*/; say "Current array element is: $content"; if ($1) { say "\t Next 10 char after the match: $1"; }else{ say "No match on $content"; } }
...produce this:
===> $content[4]: ABCD 123 456 789 Current array element is: abcdABCD1234567890xyz Next 10 char after the match: 1234567890 Current array element is: abcd12345ABCD0ABCD Next 10 char after the match: 1234567890 Current array element is: ABCD1234ABC Next 10 char after the match: 1234567890 Current array element is: qwertyABCD1234567890 Next 10 char after the match: 1234567890 Current array element is: ABCD 123 456 789 Next 10 char after the match: 1234567890

Duh! The answer to the stricken question is that $1 remains unchanged unless a new match is found... so its content is unchanged from the initial (successful) match when the regex fails on $content[1] and $content[2], then gets replaced (with the exact same thing) in $content[3] and remains unchanged when the regex fails on $content[4].

Update2: For a discussion of the "defensive programing" required to avoid the dumb coding in the stricken material, see What happens with empty $1 in regular expressions? (was: Regular Expression Question). The following code uses that practice, as best I understand it with regard to numbered captures:

for my $content(@content) { # my $match = ''; # explicit but verbose # my $match; # still explicit and only slightly less verbo +se; same effect my ($match) = $content =~ /.+?ABCD(.{10}).*/; # less code; same e +ffect say "Current array element is: $content"; if ($match) { say "\t Next 10 char after the match: $1"; }else{ say "No match on $content"; } }

:-) (...and apologies to all the electrons inconvenienced by the posting of the stricken part of this node)


Update 3 (10/30): Moritz pointed out that the initial code -- that which I initially tested -- failed because $1 is not reset unless there is a new match. His kind comments led me to discover that I had solved that (as in Update 2, above, o/a 10/27) and thus to get me off that track. Lo-and-behold, curing the tunnel vision led to a more open-minded review of the regex. Aha! (The explanation appears in the note, 'SOLVED!':

my @content = qw/abcdABCD1234567890xyz abcd12345ABCD0ABCD ABCD1234ABC qwertyABCD12diff7890/; # push @content, 'ABCD 123 456 789'; # See note "SOLVED!" below push @content, 'x ABCD 123 456 789'; # afterthought addition for my $content(@content) { my ($match) = $content =~ /.+?ABCD(.{10}).*/; # avoid probs w/no +n-reset of $1 say "Current array element is: $content"; if ( $match ) { say "\t MATCH! Next 10 char after the match: $1"; } else { say "\t No match on $content"; } } =head # SOLVED! why last array element failed to match: it initially began w +ith 'ABCD...' # BUT the regex required something ( '.+?' ) before ' ABCD(.{10} ' # And a better fix would be to write the regex as: # '/.+?ABCD(.{10}).*/' # or as: '/ABCD(.{10}).*/' C:\>934221.pl Current array element is: abcdABCD1234567890xyz MATCH! Next 10 char after the match: 1234567890 Current array element is: abcd12345ABCD0ABCD No match on abcd12345ABCD0ABCD Current array element is: ABCD1234ABC No match on ABCD1234ABC Current array element is: qwertyABCD12diff7890 MATCH! Next 10 char after the match: 12diff7890 Current array element is: x ABCD 123 456 789 MATCH! Next 10 char after the match: 123 456 7 C:\> =cut

Replies are listed 'Best First'.
Re^2: Printing ten characters succeeding a matching string (+ followup question)
by bluray (Sexton) on Oct 27, 2011 at 21:23 UTC
    Hello ww,

    Thanks for your input. Though, I was able to compile it using the previous replies, I still has to create a header in the output file. The input file doesn't have header and it starts from the first line. All the input files I have worked before have some headings, so it was easy to reference it or add a new columnheader.

    I am also thinking about getting the frequency of the ten characters. That is, if there are more than one matching ten character list, I would like to print it only once and on the second column, the number of times it was found.

      if you want to count and calculate frquencies for the 10 characters after matches i'd sugest you make a hash, and populate it dynamically, with some:

      if ($1) { say "\t Next 10 char after the match: $1"; $hash{$1}++; next; }
      just a slight modification... ;)

      update: Corrected typo pointed out by roboticus. Thanks ;)