Misstre has asked for the wisdom of the Perl Monks concerning the following question:

Apologies in advance if I'm posting this wrong, I kind of fell into a new role that needs more Perl knowledge that I'm used to, but trying to get through it.

So there's a file (will call it full file) that was created wrong by my processor, it can't easily be regenerated so I need to try and correct it. Perl is really the only scripting language I'm familiar with to even attempt at modifying this type of data, if there's something easier just let me know.

For starters the file was built from multiple data fields, and one of the important ones was joined without a separator, an example is below.

path: /unix/path/to/folder/is/here
file: file_name.sh
joined: /unix/path/to/folder/is/herefile_name.sh

As you can see this example isn't too bad, it could be manually updated to joined: /unix/path/to/folder/is/here/file_name.sh
The problem is there's thousands of these mistakes in an attribute called CMD, so it ends up looking like CMD="/unix/path/to/folder/is/herefile_name.sh" along with other attributes for each job like name, system, etc.

My thought was to load the file I need to update into an array and then try to parse the array one line at a time. I have a list of all the folder paths in a .txt file that I was hoping to use as a reference file somehow. So it would read the folder paths into an array, check each line of the full file and replace the value with the reference value if found and write it back to the file.
$references = join '|', map {chomp; qr/\Q$_\E/} <$file_paths>; $regex = qr|\b($references)\b|; while (<>) { $_ =~ s/($regex)(.*)/$1\/$2/; print; }
Essentially I have a bunch of lines scatted throughout the file that have paths that need to be updated, I think updating the file from a reference list with regex is my best option from researching, the concept should work in theroy I just can't really get my head around the implementation. Any help is appreciated.

Replies are listed 'Best First'.
Re: In place replacement from reference list
by hv (Prior) on Sep 06, 2022 at 17:45 UTC

    So it would read the folder paths into an array, check each line of the full file and replace the value with the reference value if found and write it back to the file.

    That sounds sensible, as long as the list of folder paths isn't too big.

    An important gotcha to consider: is any valid path a prefix of another valid path? If both "/foo" and "/foob" are valid, then "/foobar" could be either "/foo" + "bar" or "/foob" + "ar", and you will need some other heuristic to help decide between those interpretations. (A particularly likely case is that both "/foo" and "/foo/baz" are valid, but that this case can still be decided since the filename cannot contain "/".)

    If neither of those are problems, your example code looks close to something that would work; however you probably need to remove the "\b" at either end of the match - you need to match 'CMD="/unix/path/to/folder/is/here' followed by 'file_name.sh"'. If you can include the additional "CMD=" prefix as part of the pattern, that would also help both to ensure it doesn't change anything it shouldn't and to let the regexp run faster: $regex = qr{ CMD=" (?: $references ) }x.

    I would also recommend showing the input line in a warning if you have a CMD attribute but the regexp does not match it: if the program misses something, it's usually a lot easier to correct the program and run it again over the original file than to attempt multiple successive fixups.

      To solve something like this, in addition to the error-checking and informing of cases where ambiguity may exist, I would gravitate toward reverse sorting the directory list by length, then parsing/matching for the longest possible strings first. This would make the directory tree match to its fullest extent possible.

      If there are no ambiguities possible, given the nomenclature of the directories and files, then this may be an unnecessary step.

      Blessings,

      ~Polyglot~

Re: In place replacement from reference list
by GrandFather (Saint) on Sep 06, 2022 at 21:34 UTC

    Do the files exist in the given paths on the machine running the code? If so, you can validate each "edit".

    I'd alter the regex creation slightly:

    use strict; use warnings; my $paths = <<PATHS; /unix/path/to/folder/is/here /unix/path/to/folder/is/somewhere /unix/path/to/folder/is/herewith PATHS open my $file_paths, '<', \$paths; my $references = join '|', map {chomp; qr/\Q$_\E/} sort {length($b) <=> length($a)} <$file_paths>; print join "\n", split ('\|', $references), ''; for my $test ( "/unix/path/to/folder/is/herewitha_file.txt", "/unix/path/to/folder/is/herea_file.txt" ) { $test =~ /($references)/; print "$1\n"; }

    Prints:

    (?^:\/unix\/path\/to\/folder\/is\/somewhere) (?^:\/unix\/path\/to\/folder\/is\/herewith) (?^:\/unix\/path\/to\/folder\/is\/here) /unix/path/to/folder/is/herewith /unix/path/to/folder/is/here

    which ensures longer matches are made first. In the example code commenting out the sort results in:

    (?^:\/unix\/path\/to\/folder\/is\/here) (?^:\/unix\/path\/to\/folder\/is\/somewhere) (?^:\/unix\/path\/to\/folder\/is\/herewith) /unix/path/to/folder/is/here /unix/path/to/folder/is/here
    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: In place replacement from reference list (updated)
by AnomalousMonk (Archbishop) on Sep 06, 2022 at 21:50 UTC

    Here's another example based on hv's suggestion:

    Win8 Strawberry 5.30.3.1 (64) Tue 09/06/2022 17:27:14 C:\@Work\Perl\monks >perl use 5.010; # need \K regex extension use strict; use warnings; use autodie; use Data::Dump qw(dd); # for debug my $file_paths = <<'EOFILEPATHS'; # known file paths /unix/path/to/folder/is/here /unix/path/to/folder/also /unix/path/to/another/folder EOFILEPATHS open my $fh_reference_paths, '<', \$file_paths; my ($rx_reference) = map qr{ $_ }xms, join ' | ', map quotemeta, reverse sort map { chomp; $_; } <$fh_reference_paths> ; # dd '$rx_reference', $rx_reference; # for debug close $fh_reference_paths; my $b0rken = <<'EOBORKEN'; # paths with appended files, some borked CMD="/unix/path/to/folder/is/herefile_name.sh" #CMD="/unix/path/to/folder/is/herefile_name.sh" XYZ="/unix/path/to/folder/is/herefile_name.sh" XYZ="/unix/path/to/folder/is/here/file_name.ok" CMD = "/unix/path/to/another/folder/some_file.ok" CMD="/unix/path/to/folder/is/here/file_name.ok" CMD = "/unix/path/to/folder/alsosome_file.xy" EOBORKEN open my $fh_b0rken, '<', \$b0rken; my $add = '//'; # make addition evident for development BORKEN: while (<$fh_b0rken>) { $_ =~ s{ \A \s* CMD \s* = \s* " $rx_reference (?! /) \K }{$add}xms +; print $_; } close $fh_b0rken; ^Z CMD="/unix/path/to/folder/is/here//file_name.sh" #CMD="/unix/path/to/folder/is/herefile_name.sh" XYZ="/unix/path/to/folder/is/herefile_name.sh" XYZ="/unix/path/to/folder/is/here/file_name.ok" CMD = "/unix/path/to/another/folder/some_file.ok" CMD="/unix/path/to/folder/is/here/file_name.ok" CMD = "/unix/path/to/folder/also//some_file.xy"
    This still has the problem noted by hv of mishandling ambiguous path-filenames. Given paths
        /unix/path/foo
        /unix/path/foob
    how does one properly "correct" /unix/path/foobar.sh? The regex is designed to match the longest path (/unix/path/foob) in this case. See Building Regex Alternations Dynamically. (Of course, this problem goes away completely if one also knows all possible filenames that might be involved.)

    Update: One approach to the ambiguous path-filename problem would be to capture all matches of the $rx_reference regex and report (warn or die) if there's more than one possibility. (If there's just one possibility, just fix it.) (This code has an interaction between regex (?{ CODE }) and a lexical variable, so it must use Perl version 5.18+. If your Perl is earlier, let me know and I can supply a simple fix. I'll just put the earlier version in as a comment. Also, use (?!) vice (*FAIL) pre-5.10.)

    my @ambiguous; BORKEN: while (<$fh_b0rken>) { my @b0rken; # can use with perl version 5.18+ # local our @b0rken; use re 'eval'; # pre-5.18 version $_ =~ m{ \A (\s* CMD \s* = \s* " $rx_reference) ((?! /) [^\n]*) \n? \z (?{ push @b0rken, [ $1, $2 ] }) (*FAIL) # (?!) pre-5.10 }xms; if (@b0rken > 1) { # more than 1 possible breakage: report ambigu +ity chomp; push @ambiguous, join "\n", "'$_' ambiguous:", map " '$_->[0]'?'$_->[1]'", @b0rken; } elsif (@b0rken == 1) { # just 1 breakage: fix and output print "$b0rken[0]->[0]$add$b0rken[0]->[1]\n"; } else { # path looks ok: just output print; } } # end while BORKEN ... warn $_ for @ambiguous; # report ambiguities


    Give a man a fish:  <%-{-{-{-<

Re: In place replacement from reference list
by kcott (Archbishop) on Sep 07, 2022 at 12:11 UTC

    G'day Misstre,

    Welcome to the Monastery.

    I see your own tentative solution, and all that follow, use regexes. Perl's string handling functions are typically faster than regexes. Depending on how many "thousands of these mistakes" there are, this might make a difference. Here's a solution that doesn't use any regexes.

    #!/usr/bin/env perl use strict; use warnings; use autodie; use File::Copy; my $ref_file = 'ref.txt'; my $full_file = 'full.txt'; my $bu_file = "$full_file.BU"; #--------------------------------------------- # TODO - for demo only; remove for production copy('original_full.txt', $full_file); #--------------------------------------------- copy($full_file, $bu_file); my %ref_paths; _get_ref_paths($ref_file, \%ref_paths); { open my $ifh, '<', $bu_file; open my $ofh, '>', $full_file; while (<$ifh>) { chomp; my $cmd = substr $_, 5, -1; my @possibles = @{_assess_full_path($cmd, \%ref_paths)}; if (@possibles == 1) { $ofh->print(qq{CMD="$possibles[0]"\n}); } elsif (@possibles > 1) { $ofh->print(qq{QRY($.)="$_"\n}) for @possibles; } else { $ofh->print(qq{WTF($.)="$cmd"\n}); } } } #--------------------------------------------- # TODO - for demo only; remove for production print "\n*** ref file: '$ref_file'\n"; system cat => $ref_file; print "\n*** bu file: '$bu_file'\n"; system cat => $bu_file; print "\n*** full file: '$full_file'\n"; system cat => $full_file; #--------------------------------------------- sub _assess_full_path { my ($cmd, $ref_paths) = @_; my $possibles = []; my $pos = 1 + rindex $cmd, '/'; my $start = substr $cmd, 0, $pos; my $end = substr $cmd, $pos; my $max = substr $cmd, 0, rindex($cmd, '.') - 1; if (exists $ref_paths->{$start}) { for my $key (keys %{$ref_paths->{$start}}) { my $dir = "$start$key"; if (0 == index $max, $dir) { my $full_path = join '/', $dir, substr $cmd, length $dir; $full_path =~ y{/}{/}s; push @$possibles, $full_path; } } } return $possibles; } sub _get_ref_paths { my ($ref_file, $ref_paths) = @_; open my $fh, '<', $ref_file; while (<$fh>) { chomp; my $end = substr $_, rindex($_, '/') + 1; substr $_, rindex($_, '/') + 1, length($_), ''; $ref_paths->{$_}{$end} = 1; $ref_paths->{"$_$end/"}{''} = 1; } return; }

    I dummied up some files to test this. Here's a sample run's output:

    *** ref file: 'ref.txt' /a /a/b /a/b/c /b /b/c /c /ab /abc /abcd *** bu file: 'full.txt.BU' CMD="/a/a.sh" CMD="/aa.sh" CMD="/ab.sh" CMD="/abc.sh" CMD="/a/bc.sh" CMD="/a/b/c.sh" CMD="/a/b/c/.sh" CMD="/a/b/cd.sh" CMD="/a/b/c/d.sh" CMD="/x/y.z" CMD="/a/xyz.sh" CMD="/abcd.sh" CMD="/a/very 'special' command.exe" *** full file: 'full.txt' CMD="/a/a.sh" CMD="/a/a.sh" CMD="/a/b.sh" QRY(4)="/a/bc.sh" QRY(4)="/ab/c.sh" QRY(5)="/a/b/c.sh" QRY(5)="/a/bc.sh" CMD="/a/b/c.sh" WTF(7)="/a/b/c/.sh" QRY(8)="/a/b/cd.sh" QRY(8)="/a/b/c/d.sh" CMD="/a/b/c/d.sh" WTF(10)="/x/y.z" CMD="/a/xyz.sh" QRY(12)="/a/bcd.sh" QRY(12)="/abc/d.sh" QRY(12)="/ab/cd.sh" CMD="/a/very 'special' command.exe"

    Notes:

    • You asked about "In place replacement". This is possible using, for example, the core Tie::File module. However, I recommend that you make a backup copy: that not only acts as a safety net, but also can be used as a readonly source from which to create your "fixed" full file. This is what I did.
    • My 'original_full.txt' is identical to the 'full.txt.BU' shown above; it allows multiple demo and test runs, using the same data, without needing any manual intervention (do note the comment: "remove for production").
    • Your programs should always use the strict and warnings pragmata. The autodie pragma saves you a lot of tedious and error-prone work; let Perl handle I/O exceptions for you — I use this a lot and highly recommend it.
    • File::Copy is a core module: you'll have it already; no installation from CPAN required. It's pretty straightforward; I've only used its copy() function.
    • You talked about a "reference file" that you had but didn't show an example. I just dummied one up (ref.txt) for my use; replace with your version.
    • The output to 'full.txt' has three types:
      • CMD (Command) - an unambiguous solution was found and is written in the original CMD="..." format.
      • QRY (Query) - more than one potential solution was found; the line number of the original file is shown; requires a decision and manual intervention.
      • WTF (What's This File) - some form of bogus input was detected; the line number of the original file is shown. You may want to delete this, update the reference file, or take some other action; regardless, manual intervention is required here also.
    • Overall, the code is very straightforward and should run on whatever version of Perl you have. (For future reference, if you tell us what version you're running, there may be better solutions using more up-to-date features.) All of the functions and operators that I've used can be found in the "Perl Online Documentation"; if you get stumped on anything, just ask.

    — Ken

      > Perl's string handling functions are typically faster than regexes.

      but in this case you can or all possible paths in a regex,

      =~ m/^$path1|$path2|...etc/

      and because of automatic Trie optimization this will be significantly faster than checking in a loop.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        =~ m/^$path1|$path2|...etc/

        I know LanX and kcott understand this, but here's a general side note. In a regex expression like the one quoted above, the ^ anchor is associated only with the first alternation, i.e., ^$path1. None of the other alternations are anchored.

        The "precedence" of Perl ordered alternation is very low. This applies generally, so in
            $str =~ m/ a b c | d | e | f g h /x;
        the regex pattern "atoms" a b c comprise the first possible alternation, then d if the first alternation cannot match, then e, then the f g h sequence.

        Use grouping, typically non-capturing, to disambiguate precedence. E.g., in
            $str =~ m/ a b (?: c | d e | ... | etc) f g /x;
        the sequence a b is required for a match, then the first of c or d e or ... or etc, then the required f g sequence.


        Give a man a fish:  <%-{-{-{-<

        "... in this case ... =~ m/^$path1|$path2|...etc/ ... significantly faster than checking in a loop."

        That's a use of alternation with which I'm unfamiliar.

        Please enlighten me as to how =~ m/^$path1|$path2|...etc/ could be used to generate groups of multiple QRY... lines, for the same input line, without a loop. E.g.

        QRY(4)="/a/bc.sh" QRY(4)="/ab/c.sh"

        — Ken

Re: In place replacement from reference list
by LanX (Saint) on Sep 06, 2022 at 23:24 UTC
    this can't be solved exactly if you have ambiguous paths, like
    /a/b/cd/e.sh /a/b/c/de.sh

    both will produce /a/b/cde.sh

    But which one is correct?

    You will need to check all possible included paths and then if the "remaining" file exists.

    Once you have multiple hits, you'll need to use heuristics.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: In place replacement from reference list
by Anonymous Monk on Sep 08, 2022 at 07:45 UTC
    Hi,

    You have been given great advice by much cleverer people than me. My two penneth is to make two copies of the orlriginal file and keep them in different safe places until the task that you need the corrected file for has been completed and forgotten.

    I am a stupid man, made slightly wiser by bitter experience, it's best to be prepared for another try.

    J.C.