Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Immediately writing the results of search-and-replace

by haukex (Archbishop)
on Aug 05, 2022 at 18:29 UTC ( #11145972=note: print w/replies, xml ) Need Help??


in reply to Immediately writing the results of search-and-replace

I was under the impression that autoflushing on the currently open filehandle can be set using the $| variable, but if that's the case, why doesn't this have any affect?

No, not quite. The documentation of $| says: "If set to nonzero, forces a flush right away and after every write or print on the currently selected output channel. ... See select on how to select the output channel." So in your code, it's likely just changing the autoflush of the default output channel, STDOUT. Unless you're on an ancient Perl, the easiest way to turn on autoflushing for a handle is $filehandle->autoflush(1); (for the nitpickers out there: the "1" isn't strictly necessary but it makes the statement look more like the other IO::Handle methods).

However, from your problem description, I doubt your issue has to do with autoflushing at all!

For example, if my script finds that a file contains the search string 'IP whitelist' on a given line, even if I confirm that I want to replace this with 'IP access list', it will then prompt me to replace just 'whitelist' with one of its various candidates. What I would expect to happen in that case is that the replacement of 'IP whitelist' with 'IP access list' happens before the search for the 'whitelist' key is initiated, preventing it from finding a match there. ... Is refactoring this subroutine so that it loops over the rows of the tables first, and then over each file in the inner loop a better solution? It seems like a lot more IO to be opening and closing every file over and over for each term, but I'm not a real programmer by any stretch, so I could be way off the mark here.

Unfortunately, you haven't shown us a full example, so for example we can't see how @table is built or what its contents are - see Short, Self-Contained, Correct Example.

Since you're prompting the user for input, my tutorial Building Regex Alternations Dynamically isn't 100% applicable to your situation. However, I still suggest you read that node as it contains important advice applicable to this situation, for example regarding word boundaries (\b) or sorting your matches to match longer strings before shorter ones, which could be the cause of the issue you describe above. You might consider building a regex from all the potential search strings, and then building a hash where the search strings are the keys and the values are the possible replacements.

Another thing your code doesn't show is what edit_file is (again, SSCCE), but if it's opening the files and rewriting the files on every call, then yes, that's not particularly efficient. If the files you are editing will always be small enough to fit comfortably in memory, you may want to consider simply loading them as strings into memory and then editing them in memory, writing them back out when done.

If you were to show your efforts as a complete, runnable example with representative sample input and expected output for that input, I'm sure we could help more.

Replies are listed 'Best First'.
Re^2: Immediately writing the results of search-and-replace
by elemkeh (Novice) on Aug 05, 2022 at 19:00 UTC
    Hi haukex; I've created an account so that I can edit things going forward, but since I cannot edit the original post, please forgive posting the full code in a reply:
    use strict; use warnings; use diagnostics; use Text::CSV_XS; use File::Find; use File::Slurp qw(edit_file); use List::Util qw(uniq); my $spreadsheet = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 }); # Consume the first argument (the spreadsheet input) and push its # contents to an array of arrays. open my $fh, "<", $ARGV[0]; my @table; while (my $row = $spreadsheet->getline($fh)) { push @table, $row; } close $fh; # Sort the array-of-arrays by the length of the # first element of each subarray (our 'keys') such # that the longest key comes first, and the # shortest comes last. Then sort alphabetically. @table = sort { length $b->[0] <=> length $a->[0] || $a->[0] cmp $b->[ +0] } @table; # CSVs are received from another department with # three columns, where the second denotes part of # speech. This isn't needed for this operation, so # discard it. for my $row (@table) { splice(@$row,1,1); $row->[0] = quotemeta($row->[0]); } my @rows_to_cull = (); # Pairwise comparison of rows to detect duplicate keys. # Sort having been performed earlier, only pairwise comparisons # of adjacent rows will result in matches. for my $row_a (0..$#table) { my $key = $table[$row_a]->[0]; my @values = $table[$row_a]->[1]; for my $row_b ($row_a+1..$#table) { if ( $table[$row_a]->[0] eq $table[$row_b]->[0] ) { push @values, $table[$row_b]->[1]; $table[$row_a]->[1] = \@values; push @rows_to_cull, $row_b; } else { last; } } } # Convert the array to a hash to ensure uniqueness, # then back to an array for traversal, sorting the # array in descending numeric order so removal of # earlier rows doesn't interfere with indexing for # later rows. my %cull_hash = map { $_ => 1 } @rows_to_cull; my @cull_array = keys %cull_hash; @cull_array = sort { $b <=> $a } @cull_array; for (@cull_array) { splice(@table,$_,1); } # Loop which ensures the uniqueness of elements in the # replacement candidate arrays. for my $row (0..$#table) { my $replacement = $table[$row]->[1]; if (ref($replacement) eq 'ARRAY' && length $replacement > 1) { my @unique = uniq(@$replacement); $table[$row]->[1] = \@unique; } } # The following takes the second argument from the # command line as the directory to operate on. It # also identifies a filter subroutine to exclude # non-documentation files from the checks, and a # subroutine that will actually do the search-and-replace. find({ preprocess => \&filter, wanted => \&search_and_replace }, $ARGV[1] ); sub filter { return grep { -d or (-f and ( /\.txt$/ or /\.rst$/ or /\.yaml$/))} + @_; } # Main loop sub search_and_replace { open my $target_file, "<", $_; $| = 1; my $filename = $_; while (my $target_string = <$target_file>) { for my $row (@table) { my $search = $row->[0]; my $replacement = $row->[1]; if ((lc $target_string) =~ (lc $search)) { print "Found $search in $filename in the following context +:\n"; print "$target_string\n"; if (ref($replacement) eq 'ARRAY' && length $replacement > +1) { print "Choose a replacement candidate by typing the appr +opriate number, or else type '0' to skip.\n"; my $count = 1; for my $value (@$replacement) { print "$count\) $value\n"; $count++; } my $choice = <STDIN>; if ($choice >= 1 && $choice <= (length $replacement)) { my $replace_choice = $replacement->[$choice]; edit_file { s/$search/$replace_choice/gi} $filename; } else { print "Skipping this occurrence without replacing.\n"; } } else { print "Confirm replacement with $replacement by typing y +, or skip by typing any other key.\n"; my $choice = <STDIN>; if ($choice eq 'y' || 'Y') { edit_file { s/$search/$replacement/gi } $filename; } else { print "Skipping this occurrence without replacing.\n"; } } } } } close $target_file; }
    As for the input, here are some example rows from the CSV file itself, pre-sorted:
    whitelist entries,Noun,access list entries IP whitelist,Noun,IP access list IP whitelist entries,Noun,IP access list entries whitelist,Adjective,allow whitelist,Noun,access list whitelist,Noun,access-list whitelist,Verb,allow your whitelist,Noun,your access list
    As far as output goes, the sort should (and currently does) result in an entry like 'IP whitelist' being checked for before 'whitelist' itself, and 'IP whitelist entries' before 'IP whitelist; so I would expect that after I confirm wanting to replace 'IP whitelist entries' with 'IP access list entries', I would NOT be subsequently prompted to replace 'IP whitelist' within that instance of 'IP whitelist entries', because 'whitelist' would have already been replaced there. Similarly, I wouldn't want to be prompted to replace 'whitelist' alone in that spot for the same reason. Please let me know what more might be needed. I have started looking at the tutorial you linked and it seems promising.

    EDIT: Corrected formatting errors due to copy-pasting code block with long lines.

      Thanks for providing additional context.

      if my script finds that a file contains the search string 'IP whitelist' on a given line, even if I confirm that I want to replace this with 'IP access list', it will then prompt me to replace just 'whitelist' with one of its various candidates. What I would expect to happen in that case is that the replacement of 'IP whitelist' with 'IP access list' happens before the search for the 'whitelist' key is initiated, preventing it from finding a match there.

      What is actually going on (and what I didn't realize on my first reading) is that you're reading a line from the input file into the $target_string variable, but once that line is in that variable, it never changes, regardless of whether you edit the file it came from, since the line was copied from the file into memory. That's why the regex engine continues to find matches in $target_string even after you've made the edit in the file.

      The common pattern to use in Perl instead is to write an output file with the modified lines while reading the input file. Since in your code, you seem to want to do in-place edits of the file, see my node on that here. The slurp-then-spew approach is fairly easy if your files will always fit comfortably into memory; for line-by-line editing I might recommend my own module File::Replace (for examples of line-by-line editing, see the module's documentation instead of the aforementioned node).

      In other words, instead of your edit_file calls, you'd do something like $target_string =~ s/$search/$replace_choice/gi, and then build your output file by writing out each $target_string line. However, there is still more trickyness here: If you have a line such as "Edit the whitelist by pressing the edit whitelist button." or something like that, then because of the /g modifier on that regex, you'd only be presented with the choice for what term to replace "whilelist" with once, and that replacement would be carried out for the entire line. I kind of doubt that's a desired behavior for you.

      BTW, are you sure your search terms will always be on one line, or could strings like "whitelist entries" be split on two lines? That would also require an advanced approach.

      These are examples for reasons why I asked for representative sample input along with the expected output for that input, and in this case representative also means that it should include all of such "interesting" cases. While I'd like to provide sample code or at least more specific advice on how to fix your code, without knowing the answers to questions like these, I run the risk of writing something that ends up not working on your actual data. Extensive test cases are very useful!

      What I am currently thinking is that it should be possible to do something like s{ ($search) }{ get_replacement($1) }xeg where sub get_replacement does the prompting of the user and returns the replacement term (or the original string if the user doesn't want to replace). $search can even be a regex of all of the search terms, built dynamically.

      A few more things I noticed in your code:

      • Check your opens for errors - "open" Best Practices
      • Even though you're using File::Find, for better modularity I would use regular argument passing to sub search_and_replace, and write e.g. wanted => sub { search_and_replace(\@table, $_) }
      • It seems to me like your @table could be replaced by a hash instead of all the duplicate key detection code you currently have. Keys in Perl hashes are unique, and such tables are easy to build with Perl's autovivification - if I treat a nonexistent hash entry as if it were an array reference, it will automagically become an array reference. Note that since you want to do case-insenstive matching I'm using fc, which is better than lc in case of Unicode. Building a hash of arrays using your sample data:
        use warnings; use strict; use feature qw/fc/; use Text::CSV_XS; use List::Util qw/uniq/; use Data::Dump; my $csvfile = $ARGV[0]; my %table; open my $fh, '<', $csvfile or die "$csvfile: $!"; my $csv = Text::CSV_XS->new({binary=>1, auto_diag=>2}); while ( my $row = $csv->getline($fh) ) { push @{$table{ fc($row->[0]) }}, $row->[2]; } $csv->eof or $csv->error_diag; close $fh; $_ = [uniq @$_] for values %table; dd \%table; __END__ { "ip whitelist" => ["IP access list"], "ip whitelist entries" => ["IP access list entries"], "whitelist" => ["allow", "access list", "access-list"], "whitelist entries" => ["access list entries"], "your whitelist" => ["your access list"], }
      • As before, instead of looping over @tables (or %tables), building a single regex should be more efficient.
      • You may wish to consider modules for prompting instead of rolling your own.
      • (It's Friday evening here so I didn't scan the rest of your code in detail, so other Monks may have further tips.)
        Just wanted to make sure I took the time to thank you for your help. You've given me plenty to chew on and improve my Perl with. Cheers!

      G'day elemkeh,

      Welcome to the Monastery.

      The way you've handled transitioning from anonymous to real user is fine: no need to apologise. There is, however, a small problem.

      Long lines within <code>...</code> tags are wrapped. The continuation is highlighted with a red (default) '+'. You've copied code which has four wrapped lines and these now appear as actual code:

      • +:\n";
      • +1) {
      • +opriate number, or else type '0' to skip.\n";
      • +, or skip by typing any other key.\n";

      For future reference, follow the [download] link to a plain text version; then copy and paste from there.

      As you rightly state, being logged in, you can edit your post and make four minor changes to fix this. Do leave a note indicating that you've made a change — "How do I change/delete my post?" has more about that.

      — Ken

      Some general observations on the code:

      • if (ref($replacement) eq 'ARRAY' && length $replacement > 1) { ... }

        length always operates on a string or stringized expression. A stringized reference looks something like "ARRAY(0x123abc)", so the expression length $replacement > 1 will always be true if $replacement is a reference.

        Win8 Strawberry 5.8.9.5 (32) Fri 08/05/2022 17:16:08 C:\@Work\Perl\monks >perl use strict; use warnings; my $arrayref = [ ]; # ref. to zero length array print "$arrayref \n"; print "length non-zero \n" if length $arrayref > 1; ^Z ARRAY(0x613824) length non-zero
        The number of elements of an array is found by evaluating the array in scalar context (update: see Context and subsequent topics in perldata) or, in newer Perls (5.12+), by the keys built-in evaluated in scalar context (check your documentation).
        Win8 Strawberry 5.30.3.1 (64) Fri 08/05/2022 17:39:54 C:\@Work\Perl\monks >perl use strict; use warnings; my $arrayref = [ 9, 8, ]; # ref. to NON-zero length array printf "number of elements of referenced array: %d \n", scalar @$ +arrayref; printf "number of elements of referenced array: %d \n", 0 + @$ +arrayref; printf "number of elements of referenced array: %d \n", scalar keys @$ +arrayref; print "number of elements > 1 \n" if @$arrayref > 1; print "number of elements > 1 \n" if keys @$arrayref > 1; ^Z number of elements of referenced array: 2 number of elements of referenced array: 2 number of elements of referenced array: 2 number of elements > 1 number of elements > 1
        (Update: Just to be clear, a proper re-statement of the condition expression would be
            if (ref($replacement) eq 'ARRAY' && @$replacement > 1) { ... }
        The > numeric comparison imposes scalar context.)
      • if ($choice eq 'y' || 'Y') { ... }

        This condition evaluates as "if $choice is 'y' or if the string 'Y' is true", i.e., it's always true because 'Y' is always true. See What is true and false in Perl?

        An effective case-insensitive conditional test would be
            if (lc($choice) eq 'y') { ... }


      Give a man a fish:  <%-{-{-{-<

        Thank you for the feedback. I will do my best to internalize it and write better Perl going forward.

      Ok, I'm confused. Do you want to rewrite the entire file after each change (of course with the changed text) and then reload the file to do the next replace?
      Or do you just want to do a replace at a time (holding the text in memory) and only rewrite the entire file when you are done?

        What I am chiefly concerned with is that after matching on and replacing an instance of, say, 'IP whitelist', the user is not prompted to replace that same instance of the word 'whitelist' because it hasn't actually been edited yet. The impression I get from reading some of what's written here is that this might require write-and-reload, but I must admit that I don't know enough about Perl yet to feel confident in my interpretations. Sorry I can't clarify more than that.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11145972]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2022-09-25 21:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer my indexes to start at:




    Results (116 votes). Check out past polls.

    Notices?