Re: In place replacement from reference list

Welcome to the Monastery.

I see your own tentative solution, and all that follow, use regexes. Perl's string handling functions are typically faster than regexes. Depending on how many "thousands of these mistakes" there are, this might make a difference. Here's a solution that doesn't use any regexes.

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;

use File::Copy;

my $ref_file = 'ref.txt';
my $full_file = 'full.txt';
my $bu_file = "$full_file.BU";
#---------------------------------------------
# TODO - for demo only; remove for production
copy('original_full.txt', $full_file);
#---------------------------------------------
copy($full_file, $bu_file);

my %ref_paths;

_get_ref_paths($ref_file, \%ref_paths);

{
    open my $ifh, '<', $bu_file;
    open my $ofh, '>', $full_file;

    while (<$ifh>) {
        chomp;
        my $cmd = substr $_, 5, -1;
        my @possibles
            = @{_assess_full_path($cmd, \%ref_paths)};

        if (@possibles == 1) {
            $ofh->print(qq{CMD="$possibles[0]"\n});
        }
        elsif (@possibles > 1) {
            $ofh->print(qq{QRY($.)="$_"\n}) for @possibles;
        }
        else {
            $ofh->print(qq{WTF($.)="$cmd"\n});
        }
    }
}

#---------------------------------------------
# TODO - for demo only; remove for production
print "\n*** ref file: '$ref_file'\n";
system cat => $ref_file;
print "\n*** bu file: '$bu_file'\n";
system cat => $bu_file;
print "\n*** full file: '$full_file'\n";
system cat => $full_file;
#---------------------------------------------

sub _assess_full_path {
    my ($cmd, $ref_paths) = @_;

    my $possibles = [];

    my $pos = 1 + rindex $cmd, '/';
    my $start = substr $cmd, 0, $pos;
    my $end = substr $cmd, $pos;
    my $max = substr $cmd, 0, rindex($cmd, '.') - 1;

    if (exists $ref_paths->{$start}) {
        for my $key (keys %{$ref_paths->{$start}}) {
            my $dir = "$start$key";

            if (0 == index $max, $dir) {
                my $full_path
                    = join '/', $dir, substr $cmd, length $dir;
                $full_path =~ y{/}{/}s;
                push @$possibles, $full_path;
            }
        }
    }

    return $possibles;
}

sub _get_ref_paths {
    my ($ref_file, $ref_paths) = @_;

    open my $fh, '<', $ref_file;

    while (<$fh>) {
        chomp;
        my $end = substr $_, rindex($_, '/') + 1;
        substr $_, rindex($_, '/') + 1, length($_), '';
        $ref_paths->{$_}{$end} = 1;
        $ref_paths->{"$_$end/"}{''} = 1;
    }

    return;
}
[download]

I dummied up some files to test this. Here's a sample run's output:

*** ref file: 'ref.txt'
/a
/a/b
/a/b/c
/b
/b/c
/c
/ab
/abc
/abcd

*** bu file: 'full.txt.BU'
CMD="/a/a.sh"
CMD="/aa.sh"
CMD="/ab.sh"
CMD="/abc.sh"
CMD="/a/bc.sh"
CMD="/a/b/c.sh"
CMD="/a/b/c/.sh"
CMD="/a/b/cd.sh"
CMD="/a/b/c/d.sh"
CMD="/x/y.z"
CMD="/a/xyz.sh"
CMD="/abcd.sh"
CMD="/a/very 'special' command.exe"

*** full file: 'full.txt'
CMD="/a/a.sh"
CMD="/a/a.sh"
CMD="/a/b.sh"
QRY(4)="/a/bc.sh"
QRY(4)="/ab/c.sh"
QRY(5)="/a/b/c.sh"
QRY(5)="/a/bc.sh"
CMD="/a/b/c.sh"
WTF(7)="/a/b/c/.sh"
QRY(8)="/a/b/cd.sh"
QRY(8)="/a/b/c/d.sh"
CMD="/a/b/c/d.sh"
WTF(10)="/x/y.z"
CMD="/a/xyz.sh"
QRY(12)="/a/bcd.sh"
QRY(12)="/abc/d.sh"
QRY(12)="/ab/cd.sh"
CMD="/a/very 'special' command.exe"
[download]

Notes:

You asked about "In place replacement". This is possible using, for example, the core Tie::File module. However, I recommend that you make a backup copy: that not only acts as a safety net, but also can be used as a readonly source from which to create your "fixed" full file. This is what I did.
My 'original_full.txt' is identical to the 'full.txt.BU' shown above; it allows multiple demo and test runs, using the same data, without needing any manual intervention (do note the comment: "remove for production").
Your programs should always use the strict and warnings pragmata. The autodie pragma saves you a lot of tedious and error-prone work; let Perl handle I/O exceptions for you — I use this a lot and highly recommend it.
File::Copy is a core module: you'll have it already; no installation from CPAN required. It's pretty straightforward; I've only used its copy() function.
You talked about a "reference file" that you had but didn't show an example. I just dummied one up (ref.txt) for my use; replace with your version.
The output to 'full.txt' has three types:
- CMD (Command) - an unambiguous solution was found and is written in the original CMD="..." format.
- QRY (Query) - more than one potential solution was found; the line number of the original file is shown; requires a decision and manual intervention.
- WTF (What's This File) - some form of bogus input was detected; the line number of the original file is shown. You may want to delete this, update the reference file, or take some other action; regardless, manual intervention is required here also.
Overall, the code is very straightforward and should run on whatever version of Perl you have. (For future reference, if you tell us what version you're running, there may be better solutions using more up-to-date features.) All of the functions and operators that I've used can be found in the "Perl Online Documentation"; if you get stumped on anything, just ask.

— Ken

Comment on Re: In place replacement from reference list Select or Download Code

Replies are listed 'Best First'.
Re^2: In place replacement from reference list by LanX (Saint) on Sep 07, 2022 at 12:59 UTC
> Perl's string handling functions are typically faster than regexes. but in this case you can `or` all possible paths in a regex, `=~ m/^$path1\|$path2\|...etc/` and because of automatic Trie optimization this will be significantly faster than checking in a loop. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^3: In place replacement from reference list by AnomalousMonk (Archbishop) on Sep 07, 2022 at 20:01 UTC
`=~ m/^$path1\|$path2\|...etc/` I know LanX and kcott understand this, but here's a general side note. In a regex expression like the one quoted above, the `^` anchor is associated only with the first alternation, i.e., `^$path1`. None of the other alternations are anchored. The "precedence" of Perl ordered alternation is very low. This applies generally, so in `$str =~ m/ a b c \| d \| e \| f g h /x;` the regex pattern "atoms" `a b c` comprise the first possible alternation, then `d` if the first alternation cannot match, then `e`, then the `f g h` sequence. Use grouping, typically non-capturing, to disambiguate precedence. E.g., in `$str =~ m/ a b (?: c \| d e \| ... \| etc) f g /x;` the sequence `a b` is required for a match, then the first of `c` or `d e` or `...` or `etc`, then the required `f g` sequence. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^4: In place replacement from reference list by LanX (Saint) on Sep 07, 2022 at 21:00 UTC
I didn't think about that, it wasn't meant to be productive code just an illustration. Thanks for pointing that out! :) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^3: In place replacement from reference list by kcott (Archbishop) on Sep 07, 2022 at 17:29 UTC
"... in this case ... `=~ m/^$path1\|$path2\|...etc/` ... significantly faster than checking in a loop." That's a use of alternation with which I'm unfamiliar. Please enlighten me as to how `=~ m/^$path1\|$path2\|...etc/` could be used to generate groups of multiple `QRY...` lines, for the same input line, without a loop. E.g. `QRY(4)="/a/bc.sh" QRY(4)="/ab/c.sh"` [download] — Ken	[reply] [d/l] [select]
Re^4: In place replacement from reference list by LanX (Saint) on Sep 09, 2022 at 01:27 UTC
> ... could be used to generate groups of multiple QRY... lines One way to do it is the `(?{ collect() })(FAIL)` trick use v5.12; # https://perlmonks.org/?node_ +id=11146744 use warnings; use Data::Dump qw/pp dd/; #pp my ($paths,$cmds) = data(); my $re = join "\|", map {"\Q$_\E" } @$paths; for my $cmd (@$cmds) { my @matches; $cmd =~ m{ ^CMD=" ($re) #/? # final / is missing (?!\.) # no empty name before .extens +ion ([^/]+) "$ (?{push @matches,[$1,$2]}) (FAIL) }x; pp {$cmd => \@matches}; } sub data { return [ qw( /a /a/b /a/b/c /b /b/c /c /ab /abc /abcd )] , [ qw( CMD="/a/a.sh" CMD="/aa.sh" CMD="/ab.sh" CMD="/abc.sh" CMD="/a/bc.sh" CMD="/a/b/c.sh" CMD="/a/b/c/.sh" CMD="/a/b/cd.sh" CMD="/a/b/c/d.sh" CMD="/x/y.z" CMD="/a/xyz.sh" CMD="/abcd.sh" ), q(CMD="/a/very 'special' command.exe") ] } [download] But while my results are in sync with > `QRY(4)="/a/bc.sh" QRY(4)="/ab/c.sh"` [download] they differ significantly because my understanding is that the OP said that all CMDs have a missing final slash. I also disallowed files starting with a dot like `.sh` { "CMD=\"/a/a.sh\"" => [] } { "CMD=\"/aa.sh\"" => [["/a", "a.sh"]] } { "CMD=\"/ab.sh\"" => [["/a", "b.sh"]] } { "CMD=\"/abc.sh\"" => [["/a", "bc.sh"], ["/ab", "c.sh"]] } { "CMD=\"/a/bc.sh\"" => [["/a/b", "c.sh"]] } { "CMD=\"/a/b/c.sh\"" => [] } { "CMD=\"/a/b/c/.sh\"" => [] } { "CMD=\"/a/b/cd.sh\"" => [["/a/b/c", "d.sh"]] } { "CMD=\"/a/b/c/d.sh\"" => [] } { "CMD=\"/x/y.z\"" => [] } { "CMD=\"/a/xyz.sh\"" => [] } { "CMD=\"/abcd.sh\"" => [["/a", "bcd.sh"], ["/ab", "cd.sh"], ["/abc", +"d.sh"]], } { "CMD=\"/a/very 'special' command.exe\"" => [] } [download] YMMV, but other interpretations of the OP are easily implemented by (un)commenting the two documented lines in the regex. update Output using `/? # final / is missing #(?!\.) # no empty name before .exten +sion` [download] <Reveal this spoiler or all in this thread> Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]

update