in reply to In place replacement from reference list

G'day Misstre,

Welcome to the Monastery.

I see your own tentative solution, and all that follow, use regexes. Perl's string handling functions are typically faster than regexes. Depending on how many "thousands of these mistakes" there are, this might make a difference. Here's a solution that doesn't use any regexes.

#!/usr/bin/env perl use strict; use warnings; use autodie; use File::Copy; my $ref_file = 'ref.txt'; my $full_file = 'full.txt'; my $bu_file = "$full_file.BU"; #--------------------------------------------- # TODO - for demo only; remove for production copy('original_full.txt', $full_file); #--------------------------------------------- copy($full_file, $bu_file); my %ref_paths; _get_ref_paths($ref_file, \%ref_paths); { open my $ifh, '<', $bu_file; open my $ofh, '>', $full_file; while (<$ifh>) { chomp; my $cmd = substr $_, 5, -1; my @possibles = @{_assess_full_path($cmd, \%ref_paths)}; if (@possibles == 1) { $ofh->print(qq{CMD="$possibles[0]"\n}); } elsif (@possibles > 1) { $ofh->print(qq{QRY($.)="$_"\n}) for @possibles; } else { $ofh->print(qq{WTF($.)="$cmd"\n}); } } } #--------------------------------------------- # TODO - for demo only; remove for production print "\n*** ref file: '$ref_file'\n"; system cat => $ref_file; print "\n*** bu file: '$bu_file'\n"; system cat => $bu_file; print "\n*** full file: '$full_file'\n"; system cat => $full_file; #--------------------------------------------- sub _assess_full_path { my ($cmd, $ref_paths) = @_; my $possibles = []; my $pos = 1 + rindex $cmd, '/'; my $start = substr $cmd, 0, $pos; my $end = substr $cmd, $pos; my $max = substr $cmd, 0, rindex($cmd, '.') - 1; if (exists $ref_paths->{$start}) { for my $key (keys %{$ref_paths->{$start}}) { my $dir = "$start$key"; if (0 == index $max, $dir) { my $full_path = join '/', $dir, substr $cmd, length $dir; $full_path =~ y{/}{/}s; push @$possibles, $full_path; } } } return $possibles; } sub _get_ref_paths { my ($ref_file, $ref_paths) = @_; open my $fh, '<', $ref_file; while (<$fh>) { chomp; my $end = substr $_, rindex($_, '/') + 1; substr $_, rindex($_, '/') + 1, length($_), ''; $ref_paths->{$_}{$end} = 1; $ref_paths->{"$_$end/"}{''} = 1; } return; }

I dummied up some files to test this. Here's a sample run's output:

*** ref file: 'ref.txt' /a /a/b /a/b/c /b /b/c /c /ab /abc /abcd *** bu file: 'full.txt.BU' CMD="/a/a.sh" CMD="/aa.sh" CMD="/ab.sh" CMD="/abc.sh" CMD="/a/bc.sh" CMD="/a/b/c.sh" CMD="/a/b/c/.sh" CMD="/a/b/cd.sh" CMD="/a/b/c/d.sh" CMD="/x/y.z" CMD="/a/xyz.sh" CMD="/abcd.sh" CMD="/a/very 'special' command.exe" *** full file: 'full.txt' CMD="/a/a.sh" CMD="/a/a.sh" CMD="/a/b.sh" QRY(4)="/a/bc.sh" QRY(4)="/ab/c.sh" QRY(5)="/a/b/c.sh" QRY(5)="/a/bc.sh" CMD="/a/b/c.sh" WTF(7)="/a/b/c/.sh" QRY(8)="/a/b/cd.sh" QRY(8)="/a/b/c/d.sh" CMD="/a/b/c/d.sh" WTF(10)="/x/y.z" CMD="/a/xyz.sh" QRY(12)="/a/bcd.sh" QRY(12)="/abc/d.sh" QRY(12)="/ab/cd.sh" CMD="/a/very 'special' command.exe"

Notes:

— Ken

Replies are listed 'Best First'.
Re^2: In place replacement from reference list
by LanX (Saint) on Sep 07, 2022 at 12:59 UTC
    > Perl's string handling functions are typically faster than regexes.

    but in this case you can or all possible paths in a regex,

    =~ m/^$path1|$path2|...etc/

    and because of automatic Trie optimization this will be significantly faster than checking in a loop.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      =~ m/^$path1|$path2|...etc/

      I know LanX and kcott understand this, but here's a general side note. In a regex expression like the one quoted above, the ^ anchor is associated only with the first alternation, i.e., ^$path1. None of the other alternations are anchored.

      The "precedence" of Perl ordered alternation is very low. This applies generally, so in
          $str =~ m/ a b c | d | e | f g h /x;
      the regex pattern "atoms" a b c comprise the first possible alternation, then d if the first alternation cannot match, then e, then the f g h sequence.

      Use grouping, typically non-capturing, to disambiguate precedence. E.g., in
          $str =~ m/ a b (?: c | d e | ... | etc) f g /x;
      the sequence a b is required for a match, then the first of c or d e or ... or etc, then the required f g sequence.


      Give a man a fish:  <%-{-{-{-<

        I didn't think about that, it wasn't meant to be productive code just an illustration.

        Thanks for pointing that out! :)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

      "... in this case ... =~ m/^$path1|$path2|...etc/ ... significantly faster than checking in a loop."

      That's a use of alternation with which I'm unfamiliar.

      Please enlighten me as to how =~ m/^$path1|$path2|...etc/ could be used to generate groups of multiple QRY... lines, for the same input line, without a loop. E.g.

      QRY(4)="/a/bc.sh" QRY(4)="/ab/c.sh"

      — Ken

        > ... could be used to generate groups of multiple QRY... lines

        One way to do it is the (?{ collect() })(*FAIL) trick

        use v5.12; # https://perlmonks.org/?node_ +id=11146744 use warnings; use Data::Dump qw/pp dd/; #pp my ($paths,$cmds) = data(); my $re = join "|", map {"\Q$_\E" } @$paths; for my $cmd (@$cmds) { my @matches; $cmd =~ m{ ^CMD=" ($re) #/? # final / is missing (?!\.) # no empty name before .extens +ion ([^/]+) "$ (?{push @matches,[$1,$2]}) (*FAIL) }x; pp {$cmd => \@matches}; } sub data { return [ qw( /a /a/b /a/b/c /b /b/c /c /ab /abc /abcd )] , [ qw( CMD="/a/a.sh" CMD="/aa.sh" CMD="/ab.sh" CMD="/abc.sh" CMD="/a/bc.sh" CMD="/a/b/c.sh" CMD="/a/b/c/.sh" CMD="/a/b/cd.sh" CMD="/a/b/c/d.sh" CMD="/x/y.z" CMD="/a/xyz.sh" CMD="/abcd.sh" ), q(CMD="/a/very 'special' command.exe") ] }

        But while my results are in sync with

        >

        QRY(4)="/a/bc.sh" QRY(4)="/ab/c.sh"

        they differ significantly because my understanding is that the OP said that all CMDs have a missing final slash. I also disallowed files starting with a dot like .sh

        { "CMD=\"/a/a.sh\"" => [] } { "CMD=\"/aa.sh\"" => [["/a", "a.sh"]] } { "CMD=\"/ab.sh\"" => [["/a", "b.sh"]] } { "CMD=\"/abc.sh\"" => [["/a", "bc.sh"], ["/ab", "c.sh"]] } { "CMD=\"/a/bc.sh\"" => [["/a/b", "c.sh"]] } { "CMD=\"/a/b/c.sh\"" => [] } { "CMD=\"/a/b/c/.sh\"" => [] } { "CMD=\"/a/b/cd.sh\"" => [["/a/b/c", "d.sh"]] } { "CMD=\"/a/b/c/d.sh\"" => [] } { "CMD=\"/x/y.z\"" => [] } { "CMD=\"/a/xyz.sh\"" => [] } { "CMD=\"/abcd.sh\"" => [["/a", "bcd.sh"], ["/ab", "cd.sh"], ["/abc", +"d.sh"]], } { "CMD=\"/a/very 'special' command.exe\"" => [] }

        YMMV, but other interpretations of the OP are easily implemented by (un)commenting the two documented lines in the regex.

        update
        Output using
        /? # final / is missing #(?!\.) # no empty name before .exten +sion

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery