bluray has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perlmonks,

I was trying to get 10 characters that comes after a matching string in the output file. Though, I was able to do the matching and print it in the output, I found it a bit hard to only print the 10 characters that follows the matching word.

#!usr/bin/perl -w use strict; use warnings; my @input_files=<*.txt>; my $local_count=0; foreach my $input_file(@input_files) { unless (open(INPUT, $input_file)) { print "Cannot open file \"$input_file\"\n\n"; exit; } my $sequence='ABCD'; while (my $line=<INPUT>) { if ($local_count==0){ my $outfile=$input_file; $outfile=~ s/\.txt/\.tag\.txt/gi; unless (open (OUTPUT, ">$outfile")) { print "Cannot open file \"$outfile\"\n\n"; exit; } } chomp $line; if($line=~m/$sequence/i){ $local_count++; print OUTPUT"$sequence\n"; } } }

Replies are listed 'Best First'.
Re: Printing ten characters succeeding a matching string
by roboticus (Chancellor) on Oct 27, 2011 at 19:51 UTC

    bluray:

    Look at the "Capture Buffers" section of perldoc perlre and you'll see how to do it. Here's a quick example:

    $ cat foo.pl my $t='the quick red fox jumped over the lazy brown dog.'; if ($t=~/fox(.{10})/) { print "The 10 characters after fox are '$1'\n"; } $ perl foo.pl The 10 characters after fox are ' jumped ov'

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Printing ten characters succeeding a matching string
by hbm (Hermit) on Oct 27, 2011 at 20:00 UTC

    Unrelated to your question, dots are literal on the right side of s///; no need to escape them:

    #$outfile =~ s/\.txt/\.tag\.txt/gi; $outfile =~ s/\.txt/.tag.txt/gi;
      Thanks hbm and roboticus for the suggestions.
Re: Printing ten characters succeeding a matching string (+ followup question)
by ww (Archbishop) on Oct 27, 2011 at 20:31 UTC

    Another way... perhaps more suited as a tutorial than as production code:

    #!/usr/bin/perl use Modern::Perl; # 934221 # find a string, then print the next ten chars my @content = qw/abcdABCD1234567890xyz abcd12345ABCD0ABCD ABCD1234ABC +qwertyABCD1234567890/; for my $content(@content) { $content =~ /.+?ABCD(.{10}).*/; say "Current array element is: $content"; if ($1) { say "\t Next 10 char after the match: $1"; next; } }

    And the result of executing this script is:

    Current array element is: abcdABCD1234567890xyz Next 10 char after the match: 1234567890 Current array element is: abcd12345ABCD0ABCD Current array element is: ABCD1234ABC Current array element is: qwertyABCD1234567890 Next 10 char after the match: 1234567890

    The second and third array elements don't satisfy the regex because there are NOT 10 chars after the last instance of the sequence ABCD.

    SOLVED, below Now a question for wiser heads: add, immediately after creation of the array, another element to @content, namely, $content[4] with this line: push @content,  'ABCD 123 456 789';. Run the code. This is the output from 5.012 on a win32 box:

    Current array element is: abcdABCD1234567890xyz Next 10 char after the match: 1234567890 Current array element is: abcd12345ABCD0ABCD Current array element is: ABCD1234ABC Current array element is: qwertyABCD1234567890 Next 10 char after the match: 1234567890 Current array element is: ABCD 123 456 789

    Why doesn't the regex see a match in $content[4]

    <UPDATE:> And WTH does this minor revision,

    my @content = qw/abcdABCD1234567890xyz abcd12345ABCD0ABCD ABCD1234ABC +qwertyABCD1234567890/; push @content, 'ABCD 123 456 789'; say "===> \$content[4]: $content[4] \n\n"; for my $content(@content) { $content =~ /.+?ABCD(.{10}).*/; say "Current array element is: $content"; if ($1) { say "\t Next 10 char after the match: $1"; }else{ say "No match on $content"; } }
    ...produce this:
    ===> $content[4]: ABCD 123 456 789 Current array element is: abcdABCD1234567890xyz Next 10 char after the match: 1234567890 Current array element is: abcd12345ABCD0ABCD Next 10 char after the match: 1234567890 Current array element is: ABCD1234ABC Next 10 char after the match: 1234567890 Current array element is: qwertyABCD1234567890 Next 10 char after the match: 1234567890 Current array element is: ABCD 123 456 789 Next 10 char after the match: 1234567890

    Duh! The answer to the stricken question is that $1 remains unchanged unless a new match is found... so its content is unchanged from the initial (successful) match when the regex fails on $content[1] and $content[2], then gets replaced (with the exact same thing) in $content[3] and remains unchanged when the regex fails on $content[4].

    Update2: For a discussion of the "defensive programing" required to avoid the dumb coding in the stricken material, see What happens with empty $1 in regular expressions? (was: Regular Expression Question). The following code uses that practice, as best I understand it with regard to numbered captures:

    for my $content(@content) { # my $match = ''; # explicit but verbose # my $match; # still explicit and only slightly less verbo +se; same effect my ($match) = $content =~ /.+?ABCD(.{10}).*/; # less code; same e +ffect say "Current array element is: $content"; if ($match) { say "\t Next 10 char after the match: $1"; }else{ say "No match on $content"; } }

    :-) (...and apologies to all the electrons inconvenienced by the posting of the stricken part of this node)


    Update 3 (10/30): Moritz pointed out that the initial code -- that which I initially tested -- failed because $1 is not reset unless there is a new match. His kind comments led me to discover that I had solved that (as in Update 2, above, o/a 10/27) and thus to get me off that track. Lo-and-behold, curing the tunnel vision led to a more open-minded review of the regex. Aha! (The explanation appears in the note, 'SOLVED!':

    my @content = qw/abcdABCD1234567890xyz abcd12345ABCD0ABCD ABCD1234ABC qwertyABCD12diff7890/; # push @content, 'ABCD 123 456 789'; # See note "SOLVED!" below push @content, 'x ABCD 123 456 789'; # afterthought addition for my $content(@content) { my ($match) = $content =~ /.+?ABCD(.{10}).*/; # avoid probs w/no +n-reset of $1 say "Current array element is: $content"; if ( $match ) { say "\t MATCH! Next 10 char after the match: $1"; } else { say "\t No match on $content"; } } =head # SOLVED! why last array element failed to match: it initially began w +ith 'ABCD...' # BUT the regex required something ( '.+?' ) before ' ABCD(.{10} ' # And a better fix would be to write the regex as: # '/.+?ABCD(.{10}).*/' # or as: '/ABCD(.{10}).*/' C:\>934221.pl Current array element is: abcdABCD1234567890xyz MATCH! Next 10 char after the match: 1234567890 Current array element is: abcd12345ABCD0ABCD No match on abcd12345ABCD0ABCD Current array element is: ABCD1234ABC No match on ABCD1234ABC Current array element is: qwertyABCD12diff7890 MATCH! Next 10 char after the match: 12diff7890 Current array element is: x ABCD 123 456 789 MATCH! Next 10 char after the match: 123 456 7 C:\> =cut
      Hello ww,

      Thanks for your input. Though, I was able to compile it using the previous replies, I still has to create a header in the output file. The input file doesn't have header and it starts from the first line. All the input files I have worked before have some headings, so it was easy to reference it or add a new columnheader.

      I am also thinking about getting the frequency of the ten characters. That is, if there are more than one matching ten character list, I would like to print it only once and on the second column, the number of times it was found.

        if you want to count and calculate frquencies for the 10 characters after matches i'd sugest you make a hash, and populate it dynamically, with some:

        if ($1) { say "\t Next 10 char after the match: $1"; $hash{$1}++; next; }
        just a slight modification... ;)

        update: Corrected typo pointed out by roboticus. Thanks ;)