Here's another way to approach the problem. It is, frankly, over-engineered, but I want to try to illustrate some general ideas I have found useful. Among them are:
This script runs correctly against the example input/output files posted here (and fixed here!) (update: but see Update below about use of /o modifier in the s/// substitution).
File scrub_ref_1.pl:
Output:use warnings; use strict; use autodie; use Data::Dump qw(dd); use constant TRIGGER => qr{ \b bar \b }xms; # arms a reference number MAIN: { # all lexical variables within this scope are isolated die "usage: $0 <input filename>" unless @ARGV == 1; # slurp entire file to memory. my $filename = $ARGV[0]; my $allfile = do { local $/; open my $fh_in, '<', $filename; <$fh_in>; }; # capture list of ref object numbers to delete. my @object_numbers = do { # all these substrings appear alone on a line. my $rx_blk_start = qr{ ^ begin $ }xms; # block start my $rx_ref_n = qr{ ^ \d+ $ }xms; # ref number my $rx_blk_end = qr{ ^ end $ }xms; # block end # this matches any stuff before end of block. my $rx_not_blk_end = qr{ (?! $rx_blk_end) . }xms; # return list of captures. $allfile =~ m{ $rx_blk_start \n ($rx_ref_n) \n # capture valid ref number for deletion $rx_not_blk_end* # any stuff before block end ${ \TRIGGER } # must appear at least once in block $rx_not_blk_end* # any stuff before block end $rx_blk_end }xmsg; }; # dd 'object_numbers', \@object_numbers; # FOR DEBUG # build regex of ref object numbers to delete. my ($rx_del_ref_n) = map qr{ \b (?: $_) \b }xms, join '|', reverse sort @object_numbers ; # print 'delete ref n regex', $rx_del_ref_n, "\n"; # FOR DEBUG # delete all ref n objects from text. $allfile =~ s{ ^ (?: [ ]* foo)? [ ]+ ref [ ] $rx_del_ref_n \n } {}xmsgo; # print "edited allfile [[$allfile]] \n"; # FOR DEBUG # save processed file to new file. my $out_filename = "$filename.removed"; open my $fh_out, '>', $out_filename; print $fh_out $allfile; close $fh_out; exit; } # end MAIN block die "unexpected exit from MAIN";
c:\@Work\Perl\monks\Anonymous Monk\1232492>perl scrub_ref_1.pl text.in c:\@Work\Perl\monks\Anonymous Monk\1232492>fc /b text.in.removed text. +in.removed.au Comparing files text.in.removed and TEXT.IN.REMOVED.AU FC: no differences encountered
Update: For some inexplicable reason, I used the /o modifier with the s/// substitution in the code above. This modifier (see Regexp Quote-Like Operators in perlop) forces the regex to be compiled once and only once during execution of the script. The substitution should "properly" be
$allfile =~ s{ ^ (?: [ ]* foo)? [ ]+ ref [ ] $rx_del_ref_n \n }
{}xmsg;
As the script stands, processing only one file per invocation, the /o modifier does no harm, but confers no benefit; the s/// match regex is compiled and executed only once in any case. A problem arises if this code, which appears to work perfectly well, is recycled into another script that processes multiple files per invocation, a natural extension. In this case, the $rx_del_ref_n dynamic regex compiled for the first file processed will be used for all subsequent files because the s/// into which it is interpolated will never be re-compiled. Depending on the data being processed, this bug may be very difficult to spot!
Give a man a fish: <%-{-{-{-<
In reply to Re: some efficiency, please (updated)
by AnomalousMonk
in thread some efficiency, please
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |