in reply to some efficiency, please

Here's another way to approach the problem. It is, frankly, over-engineered, but I want to try to illustrate some general ideas I have found useful. Among them are:

This script runs correctly against the example input/output files posted here (and fixed here!) (update: but see Update below about use of  /o modifier in the  s/// substitution).
File scrub_ref_1.pl:

use warnings; use strict; use autodie; use Data::Dump qw(dd); use constant TRIGGER => qr{ \b bar \b }xms; # arms a reference number MAIN: { # all lexical variables within this scope are isolated die "usage: $0 <input filename>" unless @ARGV == 1; # slurp entire file to memory. my $filename = $ARGV[0]; my $allfile = do { local $/; open my $fh_in, '<', $filename; <$fh_in>; }; # capture list of ref object numbers to delete. my @object_numbers = do { # all these substrings appear alone on a line. my $rx_blk_start = qr{ ^ begin $ }xms; # block start my $rx_ref_n = qr{ ^ \d+ $ }xms; # ref number my $rx_blk_end = qr{ ^ end $ }xms; # block end # this matches any stuff before end of block. my $rx_not_blk_end = qr{ (?! $rx_blk_end) . }xms; # return list of captures. $allfile =~ m{ $rx_blk_start \n ($rx_ref_n) \n # capture valid ref number for deletion $rx_not_blk_end* # any stuff before block end ${ \TRIGGER } # must appear at least once in block $rx_not_blk_end* # any stuff before block end $rx_blk_end }xmsg; }; # dd 'object_numbers', \@object_numbers; # FOR DEBUG # build regex of ref object numbers to delete. my ($rx_del_ref_n) = map qr{ \b (?: $_) \b }xms, join '|', reverse sort @object_numbers ; # print 'delete ref n regex', $rx_del_ref_n, "\n"; # FOR DEBUG # delete all ref n objects from text. $allfile =~ s{ ^ (?: [ ]* foo)? [ ]+ ref [ ] $rx_del_ref_n \n } {}xmsgo; # print "edited allfile [[$allfile]] \n"; # FOR DEBUG # save processed file to new file. my $out_filename = "$filename.removed"; open my $fh_out, '>', $out_filename; print $fh_out $allfile; close $fh_out; exit; } # end MAIN block die "unexpected exit from MAIN";
Output:
c:\@Work\Perl\monks\Anonymous Monk\1232492>perl scrub_ref_1.pl text.in c:\@Work\Perl\monks\Anonymous Monk\1232492>fc /b text.in.removed text. +in.removed.au Comparing files text.in.removed and TEXT.IN.REMOVED.AU FC: no differences encountered

Update: For some inexplicable reason, I used the  /o modifier with the  s/// substitution in the code above. This modifier (see Regexp Quote-Like Operators in perlop) forces the regex to be compiled once and only once during execution of the script. The substitution should "properly" be
    $allfile =~ s{ ^ (?: [ ]* foo)? [ ]+ ref [ ] $rx_del_ref_n \n }
                 {}xmsg;

As the script stands, processing only one file per invocation, the  /o modifier does no harm, but confers no benefit; the  s/// match regex is compiled and executed only once in any case. A problem arises if this code, which appears to work perfectly well, is recycled into another script that processes multiple files per invocation, a natural extension. In this case, the  $rx_del_ref_n dynamic regex compiled for the first file processed will be used for all subsequent files because the  s/// into which it is interpolated will never be re-compiled. Depending on the data being processed, this bug may be very difficult to spot!


Give a man a fish:  <%-{-{-{-<