comment on

Here's another way to approach the problem. It is, frankly, over-engineered, but I want to try to illustrate some general ideas I have found useful. Among them are:

The MAIN block isolates all lexical variables; there are no global variables in the script.
Factoring regular expressions (to a rather extreme degree in this case!).
Dynamically building a regex from substrings extracted from the data; see haukex's article Building Regex Alternations Dynamically.

This script runs correctly against the example input/output files posted here (and fixed here!) (update: but see Update below about use of /o modifier in the s/// substitution).
File scrub_ref_1.pl:

use warnings;
use strict;
use autodie;

use Data::Dump qw(dd);

use constant TRIGGER => qr{ \b bar \b }xms;  # arms a reference number

MAIN: {  # all lexical variables within this scope are isolated

die "usage: $0 <input filename>" unless @ARGV == 1;

# slurp entire file to memory.

my $filename = $ARGV[0];

my $allfile = do {
    local $/;
    open my $fh_in, '<', $filename;
    <$fh_in>;
    };

# capture list of ref object numbers to delete.

my @object_numbers = do {

    # all these substrings appear alone on a line.
    my $rx_blk_start = qr{ ^ begin $ }xms;  # block start
    my $rx_ref_n     = qr{ ^ \d+   $ }xms;  # ref number
    my $rx_blk_end   = qr{ ^ end   $ }xms;  # block end

    # this matches any stuff before end of block.
    my $rx_not_blk_end = qr{ (?! $rx_blk_end) . }xms;

    # return list of captures.
    $allfile =~ m{
        $rx_blk_start \n
        ($rx_ref_n)   \n  # capture valid ref number for deletion
        $rx_not_blk_end*  # any stuff before block end
        ${ \TRIGGER }     # must appear at least once in block
        $rx_not_blk_end*  # any stuff before block end
        $rx_blk_end
        }xmsg;

    };
# dd 'object_numbers', \@object_numbers;  # FOR DEBUG

# build regex of ref object numbers to delete.

my ($rx_del_ref_n) =
    map qr{ \b (?: $_) \b }xms,
    join '|',
    reverse sort
    @object_numbers
    ;
# print 'delete ref n regex', $rx_del_ref_n, "\n";  # FOR DEBUG

# delete all ref n objects from text.

$allfile =~ s{ ^ (?: [ ]* foo)? [ ]+ ref [ ] $rx_del_ref_n \n }
             {}xmsgo;
# print "edited allfile [[$allfile]] \n";  # FOR DEBUG

# save processed file to new file.

my $out_filename = "$filename.removed";

open my $fh_out, '>', $out_filename;
print $fh_out $allfile;
close $fh_out;

exit;

}  # end MAIN block

die "unexpected exit from MAIN";
[download]

Output:

c:\@Work\Perl\monks\Anonymous Monk\1232492>perl scrub_ref_1.pl text.in

c:\@Work\Perl\monks\Anonymous Monk\1232492>fc /b text.in.removed text.
+in.removed.au
Comparing files text.in.removed and TEXT.IN.REMOVED.AU
FC: no differences encountered
[download]

Update: For some inexplicable reason, I used the /o modifier with the s/// substitution in the code above. This modifier (see Regexp Quote-Like Operators in perlop) forces the regex to be compiled once and only once during execution of the script. The substitution should "properly" be
$allfile =~ s{ ^ (?: [ ]* foo)? [ ]+ ref [ ] $rx_del_ref_n \n }
{}xmsg;

As the script stands, processing only one file per invocation, the /o modifier does no harm, but confers no benefit; the s/// match regex is compiled and executed only once in any case. A problem arises if this code, which appears to work perfectly well, is recycled into another script that processes multiple files per invocation, a natural extension. In this case, the $rx_del_ref_n dynamic regex compiled for the first file processed will be used for all subsequent files because the s/// into which it is interpolated will never be re-compiled. Depending on the data being processed, this bug may be very difficult to spot!

Give a man a fish: <%-{-{-{-<

In reply to Re: some efficiency, please (updated) by AnomalousMonk
in thread some efficiency, please by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.