in reply to some efficiency, please

Is there a good reason why you need to read the entire file into memory at once? If you're doing the removal process when you read the files, then you might want to do so while reading the file line-by-line. By the way, I'm not sure how your regexes line up with the data you showed, for example you say "foo ref n", but the regex seems to say there might be spaces before the "foo"? Please show an SSCCE that includes short but representative sample input data and the expected output for that input.

open my $fh, '<', $filename or die "$filename: $!"; while (<$fh>) { next if /^(?:foo )?ref \d+\b/; chomp; # process the line, for example: push @lines, $_; } close $fh;

Replies are listed 'Best First'.
Re^2: some efficiency, please
by Anonymous Monk on Apr 12, 2019 at 15:37 UTC
    Sorry, I am trying to simplify things, and likely making them more complicated. :(

    I think I can provide some sample data easier than I can change the code to make it work on the sample.

    This is what the input would look like:

    ref 1 ref 2 ref 3 ref 4 begin 1 end begin 2 bar end begin 3 end begin 4 bar end ref 5 foo ref 6 ref 7 begin 5 end begin 6 bar begin 7 bar end

    So I am trying to remove only the "ref n" lines (not the n lines themselves), and only for paragraphs where "bar" appears in the paragraph. The output should look like this:

    ref 1 ref 3 begin 1 end begin 2 bar end begin 3 end begin 4 bar end ref 5 begin 5 end begin 6 bar begin 7 bar end

    So I do (think I) need to pass through the file twice - once to find the references I want to remove, and once to actually remove them.

      So I do (think I) need to pass through the file twice - once to find the references I want to remove, and once to actually remove them.

      I took that as a challenge ;-) This only needs a single pass by reversing both the input and output by piping it through tac, and produces your desired output:

      use warnings; use strict; die "Usage: $0 INFILE\n" unless @ARGV==1; my $INFILE = shift @ARGV; open my $ofh, '|-', 'tac' or die "tac (out): $!"; open my $ifh, '-|', 'tac', $INFILE or die "tac $INFILE: $!"; my ($aminblock,$prevnum,$foundstr); my %found; while (<$ifh>) { chomp; my $out=1; if (!$aminblock) { if (/^end$/) { undef $foundstr; $aminblock=1 } elsif (/^\s*(?:foo\s+)?ref\s+(\d+)\s*$/) { die "ref $1 without block?" unless exists $found{$1}; $out = !$found{$1}; } else { die "unexpected outside of a block: $_" } } else { if (/^\s*(\d+)\s*$/) { $prevnum=$1 } elsif (/^begin$/) { die "block ended without number?" unless defined $prevnum; $found{$prevnum} = $foundstr; undef $prevnum; $aminblock=0; } else { undef $prevnum; if (/bar/) { $foundstr=1 } } } print {$ofh} $_, "\n" if $out; } close $ifh or die "tac $INFILE: ".($!||"\$?=$?"); close $ofh or die "tac (out): ".($!||"\$?=$?");

      Although the two passes through tac might actually make that less efficient for large files. Here's a two-pass version:

      use warnings; use strict; die "Usage: $0 INFILE\n" unless @ARGV==1; my $INFILE = shift @ARGV; use constant { STATE_IDLE=>0, STATE_BEGIN=>1, STATE_INBLOCK=>2 }; open my $fh, '<', $INFILE or die "$INFILE: $!"; my %found; my $state = STATE_IDLE; my $curnum; for my $pass (1..2) { while (<$fh>) { chomp; my $out = 1; if ($state==STATE_IDLE) { if (/^\s*(?:foo\s+)?ref\s+(\d+)\s*$/) { $out=!$found{$1} } elsif (/^begin$/) { $state=STATE_BEGIN } else { die "unexpected in state $state: $_" } } elsif ($state==STATE_BEGIN) { if (/^\s*(\d+)\s*$/) { $curnum=$1; $state=STATE_INBLOCK } else { die "unexpected in state $state: $_" } } elsif ($state==STATE_INBLOCK) { if (/^end$/) { $state=STATE_IDLE } elsif (/bar/) { $found{$curnum}=1 } } else { die "bad state $state" } print $_, "\n" if $pass==2 && $out; } die "unexpected state at eof: $state" unless $state==STATE_IDLE; seek $fh, 0, 0 or die "seek $INFILE: $!"; } close $fh;

      Update: Note that these solutions don't remove ref N lines if they appear inside begin...end blocks; this was an assumption I made, but it's actually unclear what the desired behavior is in that case?

      Oops, left out an end line in the example data.

      begin 6 bar

      SHOULD BE:

      begin 6 bar end
Re^2: some efficiency, please
by Anonymous Monk on Apr 12, 2019 at 15:51 UTC
    This should better reflect what I am actually trying to do (assuming I didn't make any errors).

    #!/usr/bin/perl -w use strict; local $/=undef; my @objects; # check for basic syntax if ($#ARGV < 0) { die "Usage: program.pl file.text\n"; } my $rgxpar = qr{(^begin\n(\d+)\n.*?^end$)}mos; open (FILNAM, '<', $ARGV[0]) or die "Can't open $ARGV[0] for reading.\n"; my $allfile = <FILNAM>; close FILNAM or die "Can't close $ARGV[0] for reading.\n"; while ($allfile =~ /$rgxpar/g) { my $objectref = 'ref' . $2; if ($1 =~ /bar/ ) { push (@objects, $objectref); } } for (@objects) { $allfile =~ s/^ *(foo)? +$_\n//mn; } open ( OUTFIL, '>', "$ARGV[0].removed") or die "Can't open $ARGV[0].removed for writing.\n"; print OUTFIL $allfile; close OUTFIL or die "Can't close $ARGV[0].removed for writing.\n";
      If the files are very large, you'll spend more time disk swapping than actually reading/writing.

      Make 2 passes: Record all of the "ref" numbers you want to delete in the first pass (use a hash), then reread the file, printing it out according to whether a ref value is in the hash.

      But to do this well, with multiline data, you'll have to tell us what a "paragraph" is, because it's not clear to me from your description.

      It might look something like this:

      my %ignore; # First pass while (<FH>) { $ignore{$1} = 1 if some_condition($_); } # Second pass # reset the file to the beginning seek FH, 0, 0; while (<FH>) { if (m/matches interesting string with (capture)/) { if (exists($ignore{$1}) { next; # don't print this line print; }

      The trick, of course, is some_condition;

      If it's hard to put a single paragraph into a regex, just note the signposts with flags. Something like this for the first pass:

      my $in_paragraph; my $bar; my %ignore; while (<FH>) { if (m/start of paragraph/) { $in_paragraph = 1; $bar = 0 next; } if (m/end of paragraph/) { $in_paragraph = 0; next; } if (m/line with bar/) { $bar = 1; next; } if (m/line with ref (\d+)/) { if ($begin and $bar and not $end) { $ignore{$1} = 1; } next; } }

      And then something very similar to that in the 2nd pass, except printing or not printing based on your logic. (If you were very clever, you could reuse that code, with a tweak, passing a parameter for the pass number. But don't get clever until it works.)

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of