In place replacement from reference list

Misstre has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: In place replacement from reference list by hv (Prior) on Sep 06, 2022 at 17:45 UTC
So it would read the folder paths into an array, check each line of the full file and replace the value with the reference value if found and write it back to the file. That sounds sensible, as long as the list of folder paths isn't too big. An important gotcha to consider: is any valid path a prefix of another valid path? If both "/foo" and "/foob" are valid, then "/foobar" could be either "/foo" + "bar" or "/foob" + "ar", and you will need some other heuristic to help decide between those interpretations. (A particularly likely case is that both "/foo" and "/foo/baz" are valid, but that this case can still be decided since the filename cannot contain "/".) If neither of those are problems, your example code looks close to something that would work; however you probably need to remove the "\b" at either end of the match - you need to match 'CMD="/unix/path/to/folder/is/here' followed by 'file_name.sh"'. If you can include the additional "CMD=" prefix as part of the pattern, that would also help both to ensure it doesn't change anything it shouldn't and to let the regexp run faster: `$regex = qr{ CMD=" (?: $references ) }x`. I would also recommend showing the input line in a warning if you have a CMD attribute but the regexp does not match it: if the program misses something, it's usually a lot easier to correct the program and run it again over the original file than to attempt multiple successive fixups.	[reply] [d/l]
Re^2: In place replacement from reference list by Polyglot (Chaplain) on Sep 07, 2022 at 05:44 UTC
To solve something like this, in addition to the error-checking and informing of cases where ambiguity may exist, I would gravitate toward reverse sorting the directory list by length, then parsing/matching for the longest possible strings first. This would make the directory tree match to its fullest extent possible. If there are no ambiguities possible, given the nomenclature of the directories and files, then this may be an unnecessary step. Blessings, ~Polyglot~	[reply]
Re: In place replacement from reference list by GrandFather (Saint) on Sep 06, 2022 at 21:34 UTC
Do the files exist in the given paths on the machine running the code? If so, you can validate each "edit". I'd alter the regex creation slightly: `use strict; use warnings; my $paths = <<PATHS; /unix/path/to/folder/is/here /unix/path/to/folder/is/somewhere /unix/path/to/folder/is/herewith PATHS open my $file_paths, '<', \$paths; my $references = join '\|', map {chomp; qr/\Q$_\E/} sort {length($b) <=> length($a)} <$file_paths>; print join "\n", split ('\\|', $references), ''; for my $test ( "/unix/path/to/folder/is/herewitha_file.txt", "/unix/path/to/folder/is/herea_file.txt" ) { $test =~ /($references)/; print "$1\n"; }` [download] Prints: `(?^:\/unix\/path\/to\/folder\/is\/somewhere) (?^:\/unix\/path\/to\/folder\/is\/herewith) (?^:\/unix\/path\/to\/folder\/is\/here) /unix/path/to/folder/is/herewith /unix/path/to/folder/is/here` [download] which ensures longer matches are made first. In the example code commenting out the sort results in: `(?^:\/unix\/path\/to\/folder\/is\/here) (?^:\/unix\/path\/to\/folder\/is\/somewhere) (?^:\/unix\/path\/to\/folder\/is\/herewith) /unix/path/to/folder/is/here /unix/path/to/folder/is/here` [download] Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply] [d/l] [select]
Re: In place replacement from reference list (updated) by AnomalousMonk (Archbishop) on Sep 06, 2022 at 21:50 UTC
Here's another example based on hv's suggestion: Win8 Strawberry 5.30.3.1 (64) Tue 09/06/2022 17:27:14 C:\@Work\Perl\monks >perl use 5.010; # need \K regex extension use strict; use warnings; use autodie; use Data::Dump qw(dd); # for debug my $file_paths = <<'EOFILEPATHS'; # known file paths /unix/path/to/folder/is/here /unix/path/to/folder/also /unix/path/to/another/folder EOFILEPATHS open my $fh_reference_paths, '<', \$file_paths; my ($rx_reference) = map qr{ $_ }xms, join ' \| ', map quotemeta, reverse sort map { chomp; $_; } <$fh_reference_paths> ; # dd '$rx_reference', $rx_reference; # for debug close $fh_reference_paths; my $b0rken = <<'EOBORKEN'; # paths with appended files, some borked CMD="/unix/path/to/folder/is/herefile_name.sh" #CMD="/unix/path/to/folder/is/herefile_name.sh" XYZ="/unix/path/to/folder/is/herefile_name.sh" XYZ="/unix/path/to/folder/is/here/file_name.ok" CMD = "/unix/path/to/another/folder/some_file.ok" CMD="/unix/path/to/folder/is/here/file_name.ok" CMD = "/unix/path/to/folder/alsosome_file.xy" EOBORKEN open my $fh_b0rken, '<', \$b0rken; my $add = '//'; # make addition evident for development BORKEN: while (<$fh_b0rken>) { $_ =~ s{ \A \s* CMD \s* = \s* " $rx_reference (?! /) \K }{$add}xms +; print $_; } close $fh_b0rken; ^Z CMD="/unix/path/to/folder/is/here//file_name.sh" #CMD="/unix/path/to/folder/is/herefile_name.sh" XYZ="/unix/path/to/folder/is/herefile_name.sh" XYZ="/unix/path/to/folder/is/here/file_name.ok" CMD = "/unix/path/to/another/folder/some_file.ok" CMD="/unix/path/to/folder/is/here/file_name.ok" CMD = "/unix/path/to/folder/also//some_file.xy" [download] This still has the problem noted by hv of mishandling ambiguous path-filenames. Given paths `/unix/path/foo` `/unix/path/foob` how does one properly "correct" `/unix/path/foobar.sh`? The regex is designed to match the longest path (`/unix/path/foob`) in this case. See Building Regex Alternations Dynamically. (Of course, this problem goes away completely if one also knows all possible filenames that might be involved.) Update: One approach to the ambiguous path-filename problem would be to capture all matches of the `$rx_reference` regex and report (warn or die) if there's more than one possibility. (If there's just one possibility, just fix it.) (This code has an interaction between regex `(?{ CODE })` and a lexical variable, so it must use Perl version 5.18+. ~~If your Perl is earlier, let me know and I can supply a simple fix.~~ I'll just put the earlier version in as a comment. Also, use `(?!)` vice `(FAIL)` pre-5.10.) my @ambiguous; BORKEN: while (<$fh_b0rken>) { my @b0rken; # can use with perl version 5.18+ # local our @b0rken; use re 'eval'; # pre-5.18 version $_ =~ m{ \A (\s CMD \s* = \s* " $rx_reference) ((?! /) [^\n]) \n? \z (?{ push @b0rken, [ $1, $2 ] }) (FAIL) # (?!) pre-5.10 }xms; if (@b0rken > 1) { # more than 1 possible breakage: report ambigu +ity chomp; push @ambiguous, join "\n", "'$_' ambiguous:", map " '$_->[0]'?'$_->[1]'", @b0rken; } elsif (@b0rken == 1) { # just 1 breakage: fix and output print "$b0rken[0]->[0]$add$b0rken[0]->[1]\n"; } else { # path looks ok: just output print; } } # end while BORKEN ... warn $_ for @ambiguous; # report ambiguities [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: In place replacement from reference list by kcott (Archbishop) on Sep 07, 2022 at 12:11 UTC
G'day Misstre, Welcome to the Monastery. I see your own tentative solution, and all that follow, use regexes. Perl's string handling functions are typically faster than regexes. Depending on how many "thousands of these mistakes" there are, this might make a difference. Here's a solution that doesn't use any regexes. #!/usr/bin/env perl use strict; use warnings; use autodie; use File::Copy; my $ref_file = 'ref.txt'; my $full_file = 'full.txt'; my $bu_file = "$full_file.BU"; #--------------------------------------------- # TODO - for demo only; remove for production copy('original_full.txt', $full_file); #--------------------------------------------- copy($full_file, $bu_file); my %ref_paths; _get_ref_paths($ref_file, \%ref_paths); { open my $ifh, '<', $bu_file; open my $ofh, '>', $full_file; while (<$ifh>) { chomp; my $cmd = substr $_, 5, -1; my @possibles = @{_assess_full_path($cmd, \%ref_paths)}; if (@possibles == 1) { $ofh->print(qq{CMD="$possibles[0]"\n}); } elsif (@possibles > 1) { $ofh->print(qq{QRY($.)="$_"\n}) for @possibles; } else { $ofh->print(qq{WTF($.)="$cmd"\n}); } } } #--------------------------------------------- # TODO - for demo only; remove for production print "\n* ref file: '$ref_file'\n"; system cat => $ref_file; print "\n* bu file: '$bu_file'\n"; system cat => $bu_file; print "\n* full file: '$full_file'\n"; system cat => $full_file; #--------------------------------------------- sub _assess_full_path { my ($cmd, $ref_paths) = @_; my $possibles = []; my $pos = 1 + rindex $cmd, '/'; my $start = substr $cmd, 0, $pos; my $end = substr $cmd, $pos; my $max = substr $cmd, 0, rindex($cmd, '.') - 1; if (exists $ref_paths->{$start}) { for my $key (keys %{$ref_paths->{$start}}) { my $dir = "$start$key"; if (0 == index $max, $dir) { my $full_path = join '/', $dir, substr $cmd, length $dir; $full_path =~ y{/}{/}s; push @$possibles, $full_path; } } } return $possibles; } sub _get_ref_paths { my ($ref_file, $ref_paths) = @_; open my $fh, '<', $ref_file; while (<$fh>) { chomp; my $end = substr $_, rindex($_, '/') + 1; substr $_, rindex($_, '/') + 1, length($_), ''; $ref_paths->{$_}{$end} = 1; $ref_paths->{"$_$end/"}{''} = 1; } return; } [download] I dummied up some files to test this. Here's a sample run's output: * ref file: 'ref.txt' /a /a/b /a/b/c /b /b/c /c /ab /abc /abcd * bu file: 'full.txt.BU' CMD="/a/a.sh" CMD="/aa.sh" CMD="/ab.sh" CMD="/abc.sh" CMD="/a/bc.sh" CMD="/a/b/c.sh" CMD="/a/b/c/.sh" CMD="/a/b/cd.sh" CMD="/a/b/c/d.sh" CMD="/x/y.z" CMD="/a/xyz.sh" CMD="/abcd.sh" CMD="/a/very 'special' command.exe" * full file: 'full.txt' CMD="/a/a.sh" CMD="/a/a.sh" CMD="/a/b.sh" QRY(4)="/a/bc.sh" QRY(4)="/ab/c.sh" QRY(5)="/a/b/c.sh" QRY(5)="/a/bc.sh" CMD="/a/b/c.sh" WTF(7)="/a/b/c/.sh" QRY(8)="/a/b/cd.sh" QRY(8)="/a/b/c/d.sh" CMD="/a/b/c/d.sh" WTF(10)="/x/y.z" CMD="/a/xyz.sh" QRY(12)="/a/bcd.sh" QRY(12)="/abc/d.sh" QRY(12)="/ab/cd.sh" CMD="/a/very 'special' command.exe" [download] Notes: You asked about "In place replacement". This is possible using, for example, the core Tie::File module. However, I recommend that you make a backup copy: that not only acts as a safety net, but also can be used as a readonly source from which to create your "fixed" full file. This is what I did. My '`original_full.txt`' is identical to the '`full.txt.BU`' shown above; it allows multiple demo and test runs, using the same data, without needing any manual intervention (do note the comment: "remove for production"). Your programs should always use the strict and warnings pragmata. The autodie pragma saves you a lot of tedious and error-prone work; let Perl handle I/O exceptions for you — I use this a lot and highly recommend it. File::Copy is a core module: you'll have it already; no installation from CPAN required. It's pretty straightforward; I've only used its `copy()` function. You talked about a "reference file" that you had but didn't show an example. I just dummied one up (`ref.txt`) for my use; replace with your version. The output to '`full.txt`' has three types: `CMD` (Command) - an unambiguous solution was found and is written in the original `CMD="..."` format. `QRY` (Query) - more than one potential solution was found; the line number of the original file is shown; requires a decision and manual intervention. `WTF` (What's This File) - some form of bogus input was detected; the line number of the original file is shown. You may want to delete this, update the reference file, or take some other action; regardless, manual intervention is required here also. Overall, the code is very straightforward and should run on whatever version of Perl you have. (For future reference, if you tell us what version you're running, there may be better solutions using more up-to-date features.) All of the functions and operators that I've used can be found in the "Perl Online Documentation"; if you get stumped on anything, just ask. — Ken	[reply] [d/l] [select]
Re^2: In place replacement from reference list by LanX (Saint) on Sep 07, 2022 at 12:59 UTC
> Perl's string handling functions are typically faster than regexes. but in this case you can `or` all possible paths in a regex, `=~ m/^$path1\|$path2\|...etc/` and because of automatic Trie optimization this will be significantly faster than checking in a loop. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^3: In place replacement from reference list by AnomalousMonk (Archbishop) on Sep 07, 2022 at 20:01 UTC
`=~ m/^$path1\|$path2\|...etc/` I know LanX and kcott understand this, but here's a general side note. In a regex expression like the one quoted above, the `^` anchor is associated only with the first alternation, i.e., `^$path1`. None of the other alternations are anchored. The "precedence" of Perl ordered alternation is very low. This applies generally, so in `$str =~ m/ a b c \| d \| e \| f g h /x;` the regex pattern "atoms" `a b c` comprise the first possible alternation, then `d` if the first alternation cannot match, then `e`, then the `f g h` sequence. Use grouping, typically non-capturing, to disambiguate precedence. E.g., in `$str =~ m/ a b (?: c \| d e \| ... \| etc) f g /x;` the sequence `a b` is required for a match, then the first of `c` or `d e` or `...` or `etc`, then the required `f g` sequence. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^4: In place replacement from reference list by LanX (Saint) on Sep 07, 2022 at 21:00 UTC
Re^3: In place replacement from reference list by kcott (Archbishop) on Sep 07, 2022 at 17:29 UTC
"... in this case ... `=~ m/^$path1\|$path2\|...etc/` ... significantly faster than checking in a loop." That's a use of alternation with which I'm unfamiliar. Please enlighten me as to how `=~ m/^$path1\|$path2\|...etc/` could be used to generate groups of multiple `QRY...` lines, for the same input line, without a loop. E.g. `QRY(4)="/a/bc.sh" QRY(4)="/ab/c.sh"` [download] — Ken	[reply] [d/l] [select]
Re^4: In place replacement from reference list by LanX (Saint) on Sep 09, 2022 at 01:27 UTC
Re: In place replacement from reference list by LanX (Saint) on Sep 06, 2022 at 23:24 UTC
this can't be solved exactly if you have ambiguous paths, like `/a/b/cd/e.sh /a/b/c/de.sh` [download] both will produce `/a/b/cde.sh` But which one is correct? You will need to check all possible included paths and then if the "remaining" file exists. Once you have multiple hits, you'll need to use heuristics. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re: In place replacement from reference list by Anonymous Monk on Sep 08, 2022 at 07:45 UTC
Hi, You have been given great advice by much cleverer people than me. My two penneth is to make two copies of the orlriginal file and keep them in different safe places until the task that you need the corrected file for has been completed and forgotten. I am a stupid man, made slightly wiser by bitter experience, it's best to be prepared for another try. J.C.	[reply]