Re: some efficiency, please

Replies are listed 'Best First'.
Re^2: some efficiency, please by Anonymous Monk on Apr 12, 2019 at 15:37 UTC
Sorry, I am trying to simplify things, and likely making them more complicated. :( I think I can provide some sample data easier than I can change the code to make it work on the sample. This is what the input would look like: `ref 1 ref 2 ref 3 ref 4 begin 1 end begin 2 bar end begin 3 end begin 4 bar end ref 5 foo ref 6 ref 7 begin 5 end begin 6 bar begin 7 bar end` [download] So I am trying to remove only the "ref n" lines (not the n lines themselves), and only for paragraphs where "bar" appears in the paragraph. The output should look like this: `ref 1 ref 3 begin 1 end begin 2 bar end begin 3 end begin 4 bar end ref 5 begin 5 end begin 6 bar begin 7 bar end` [download] So I do (think I) need to pass through the file twice - once to find the references I want to remove, and once to actually remove them.	[reply] [d/l] [select]
Re^3: some efficiency, please (updated) by haukex (Archbishop) on Apr 12, 2019 at 16:19 UTC
So I do (think I) need to pass through the file twice - once to find the references I want to remove, and once to actually remove them. I took that as a challenge `;-)` This only needs a single pass by reversing both the input and output by piping it through tac, and produces your desired output: use warnings; use strict; die "Usage: $0 INFILE\n" unless @ARGV==1; my $INFILE = shift @ARGV; open my $ofh, '\|-', 'tac' or die "tac (out): $!"; open my $ifh, '-\|', 'tac', $INFILE or die "tac $INFILE: $!"; my ($aminblock,$prevnum,$foundstr); my %found; while (<$ifh>) { chomp; my $out=1; if (!$aminblock) { if (/^end$/) { undef $foundstr; $aminblock=1 } elsif (/^\s(?:foo\s+)?ref\s+(\d+)\s$/) { die "ref $1 without block?" unless exists $found{$1}; $out = !$found{$1}; } else { die "unexpected outside of a block: $_" } } else { if (/^\s(\d+)\s$/) { $prevnum=$1 } elsif (/^begin$/) { die "block ended without number?" unless defined $prevnum; $found{$prevnum} = $foundstr; undef $prevnum; $aminblock=0; } else { undef $prevnum; if (/bar/) { $foundstr=1 } } } print {$ofh} $_, "\n" if $out; } close $ifh or die "tac $INFILE: ".($!\|\|"\$?=$?"); close $ofh or die "tac (out): ".($!\|\|"\$?=$?"); [download] Although the two passes through tac might actually make that less efficient for large files. Here's a two-pass version: use warnings; use strict; die "Usage: $0 INFILE\n" unless @ARGV==1; my $INFILE = shift @ARGV; use constant { STATE_IDLE=>0, STATE_BEGIN=>1, STATE_INBLOCK=>2 }; open my $fh, '<', $INFILE or die "$INFILE: $!"; my %found; my $state = STATE_IDLE; my $curnum; for my $pass (1..2) { while (<$fh>) { chomp; my $out = 1; if ($state==STATE_IDLE) { if (/^\s(?:foo\s+)?ref\s+(\d+)\s$/) { $out=!$found{$1} } elsif (/^begin$/) { $state=STATE_BEGIN } else { die "unexpected in state $state: $_" } } elsif ($state==STATE_BEGIN) { if (/^\s(\d+)\s$/) { $curnum=$1; $state=STATE_INBLOCK } else { die "unexpected in state $state: $_" } } elsif ($state==STATE_INBLOCK) { if (/^end$/) { $state=STATE_IDLE } elsif (/bar/) { $found{$curnum}=1 } } else { die "bad state $state" } print $_, "\n" if $pass==2 && $out; } die "unexpected state at eof: $state" unless $state==STATE_IDLE; seek $fh, 0, 0 or die "seek $INFILE: $!"; } close $fh; [download] Update: Note that these solutions don't remove `ref N` lines if they appear inside `begin...end` blocks; this was an assumption I made, but it's actually unclear what the desired behavior is in that case?	[reply] [d/l] [select]
Re^3: some efficiency, please by Anonymous Monk on Apr 12, 2019 at 15:53 UTC
Oops, left out an end line in the example data. `begin 6 bar` [download] SHOULD BE: `begin 6 bar end` [download]	[reply] [d/l] [select]
Re^4: some efficiency, please by haukex (Archbishop) on Apr 12, 2019 at 16:21 UTC
Oops, left out an end line in the example data. If you register an account, you can edit your posts!	[reply]
Re^2: some efficiency, please by Anonymous Monk on Apr 12, 2019 at 15:51 UTC
This should better reflect what I am actually trying to do (assuming I didn't make any errors). #!/usr/bin/perl -w use strict; local $/=undef; my @objects; # check for basic syntax if ($#ARGV < 0) { die "Usage: program.pl file.text\n"; } my $rgxpar = qr{(^begin\n(\d+)\n.?^end$)}mos; open (FILNAM, '<', $ARGV[0]) or die "Can't open $ARGV[0] for reading.\n"; my $allfile = <FILNAM>; close FILNAM or die "Can't close $ARGV[0] for reading.\n"; while ($allfile =~ /$rgxpar/g) { my $objectref = 'ref' . $2; if ($1 =~ /bar/ ) { push (@objects, $objectref); } } for (@objects) { $allfile =~ s/^ (foo)? +$_\n//mn; } open ( OUTFIL, '>', "$ARGV[0].removed") or die "Can't open $ARGV[0].removed for writing.\n"; print OUTFIL $allfile; close OUTFIL or die "Can't close $ARGV[0].removed for writing.\n"; [download]	[reply] [d/l]
Re^3: some efficiency, please by QM (Parson) on Apr 12, 2019 at 16:19 UTC
If the files are very large, you'll spend more time disk swapping than actually reading/writing. Make 2 passes: Record all of the "ref" numbers you want to delete in the first pass (use a hash), then reread the file, printing it out according to whether a ref value is in the hash. But to do this well, with multiline data, you'll have to tell us what a "paragraph" is, because it's not clear to me from your description. It might look something like this: `my %ignore; # First pass while (<FH>) { $ignore{$1} = 1 if some_condition($_); } # Second pass # reset the file to the beginning seek FH, 0, 0; while (<FH>) { if (m/matches interesting string with (capture)/) { if (exists($ignore{$1}) { next; # don't print this line print; }` [download] The trick, of course, is `some_condition`; If it's hard to put a single paragraph into a regex, just note the signposts with flags. Something like this for the first pass: `my $in_paragraph; my $bar; my %ignore; while (<FH>) { if (m/start of paragraph/) { $in_paragraph = 1; $bar = 0 next; } if (m/end of paragraph/) { $in_paragraph = 0; next; } if (m/line with bar/) { $bar = 1; next; } if (m/line with ref (\d+)/) { if ($begin and $bar and not $end) { $ignore{$1} = 1; } next; } }` [download] And then something very similar to that in the 2nd pass, except printing or not printing based on your logic. (If you were very clever, you could reuse that code, with a tweak, passing a parameter for the pass number. But don't get clever until it works.) -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l] [select]