So it would read the folder paths into an array, check each line of the full file and replace the value with the reference value if found and write it back to the file.
That sounds sensible, as long as the list of folder paths isn't too big.
An important gotcha to consider: is any valid path a prefix of another valid path? If both "/foo" and "/foob" are valid, then "/foobar" could be either "/foo" + "bar" or "/foob" + "ar", and you will need some other heuristic to help decide between those interpretations. (A particularly likely case is that both "/foo" and "/foo/baz" are valid, but that this case can still be decided since the filename cannot contain "/".)
If neither of those are problems, your example code looks close to something that would work; however you probably need to remove the "\b" at either end of the match - you need to match 'CMD="/unix/path/to/folder/is/here' followed by 'file_name.sh"'. If you can include the additional "CMD=" prefix as part of the pattern, that would also help both to ensure it doesn't change anything it shouldn't and to let the regexp run faster: $regex = qr{ CMD=" (?: $references ) }x.
I would also recommend showing the input line in a warning if you have a CMD attribute but the regexp does not match it: if the program misses something, it's usually a lot easier to correct the program and run it again over the original file than to attempt multiple successive fixups.
| [reply] [d/l] |
To solve something like this, in addition to the error-checking and informing of cases where ambiguity may exist, I would gravitate toward reverse sorting the directory list by length, then parsing/matching for the longest possible strings first. This would make the directory tree match to its fullest extent possible.
If there are no ambiguities possible, given the nomenclature of the directories and files, then this may be an unnecessary step.
| [reply] |
Do the files exist in the given paths on the machine running the code? If so, you can validate each "edit".
I'd alter the regex creation slightly:
use strict;
use warnings;
my $paths = <<PATHS;
/unix/path/to/folder/is/here
/unix/path/to/folder/is/somewhere
/unix/path/to/folder/is/herewith
PATHS
open my $file_paths, '<', \$paths;
my $references = join '|', map {chomp; qr/\Q$_\E/}
sort {length($b) <=> length($a)}
<$file_paths>;
print join "\n", split ('\|', $references), '';
for my $test (
"/unix/path/to/folder/is/herewitha_file.txt",
"/unix/path/to/folder/is/herea_file.txt"
) {
$test =~ /($references)/;
print "$1\n";
}
Prints:
(?^:\/unix\/path\/to\/folder\/is\/somewhere)
(?^:\/unix\/path\/to\/folder\/is\/herewith)
(?^:\/unix\/path\/to\/folder\/is\/here)
/unix/path/to/folder/is/herewith
/unix/path/to/folder/is/here
which ensures longer matches are made first. In the example code commenting out the sort results in:
(?^:\/unix\/path\/to\/folder\/is\/here)
(?^:\/unix\/path\/to\/folder\/is\/somewhere)
(?^:\/unix\/path\/to\/folder\/is\/herewith)
/unix/path/to/folder/is/here
/unix/path/to/folder/is/here
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] [d/l] [select] |
Win8 Strawberry 5.30.3.1 (64) Tue 09/06/2022 17:27:14
C:\@Work\Perl\monks
>perl
use 5.010; # need \K regex extension
use strict;
use warnings;
use autodie;
use Data::Dump qw(dd); # for debug
my $file_paths = <<'EOFILEPATHS'; # known file paths
/unix/path/to/folder/is/here
/unix/path/to/folder/also
/unix/path/to/another/folder
EOFILEPATHS
open my $fh_reference_paths, '<', \$file_paths;
my ($rx_reference) =
map qr{ $_ }xms,
join ' | ',
map quotemeta,
reverse sort
map { chomp; $_; }
<$fh_reference_paths>
;
# dd '$rx_reference', $rx_reference; # for debug
close $fh_reference_paths;
my $b0rken = <<'EOBORKEN'; # paths with appended files, some borked
CMD="/unix/path/to/folder/is/herefile_name.sh"
#CMD="/unix/path/to/folder/is/herefile_name.sh"
XYZ="/unix/path/to/folder/is/herefile_name.sh"
XYZ="/unix/path/to/folder/is/here/file_name.ok"
CMD = "/unix/path/to/another/folder/some_file.ok"
CMD="/unix/path/to/folder/is/here/file_name.ok"
CMD = "/unix/path/to/folder/alsosome_file.xy"
EOBORKEN
open my $fh_b0rken, '<', \$b0rken;
my $add = '//'; # make addition evident for development
BORKEN:
while (<$fh_b0rken>) {
$_ =~ s{ \A \s* CMD \s* = \s* " $rx_reference (?! /) \K }{$add}xms
+;
print $_;
}
close $fh_b0rken;
^Z
CMD="/unix/path/to/folder/is/here//file_name.sh"
#CMD="/unix/path/to/folder/is/herefile_name.sh"
XYZ="/unix/path/to/folder/is/herefile_name.sh"
XYZ="/unix/path/to/folder/is/here/file_name.ok"
CMD = "/unix/path/to/another/folder/some_file.ok"
CMD="/unix/path/to/folder/is/here/file_name.ok"
CMD = "/unix/path/to/folder/also//some_file.xy"
This still has the problem noted by hv of mishandling ambiguous path-filenames. Given paths
/unix/path/foo
/unix/path/foob
how does one properly "correct" /unix/path/foobar.sh? The regex is designed to match the longest path (/unix/path/foob) in this case. See Building Regex Alternations Dynamically. (Of course, this problem goes away completely if one also knows all possible filenames
that might be involved.)
Update: One approach to the ambiguous path-filename problem would be to capture all matches of the $rx_reference regex and report (warn or die) if there's more than one possibility. (If there's just one possibility, just fix it.) (This code has an interaction between regex (?{ CODE }) and a lexical variable, so it must use Perl version 5.18+. If your Perl is earlier, let me know and I can supply a simple fix. I'll just put the earlier version in as a comment. Also, use
(?!) vice (*FAIL) pre-5.10.)
my @ambiguous;
BORKEN:
while (<$fh_b0rken>) {
my @b0rken; # can use with perl version 5.18+
# local our @b0rken; use re 'eval'; # pre-5.18 version
$_ =~ m{
\A (\s* CMD \s* = \s* " $rx_reference) ((?! /) [^\n]*) \n? \z
(?{ push @b0rken, [ $1, $2 ] }) (*FAIL) # (?!) pre-5.10
}xms;
if (@b0rken > 1) { # more than 1 possible breakage: report ambigu
+ity
chomp;
push @ambiguous, join "\n",
"'$_' ambiguous:", map " '$_->[0]'?'$_->[1]'", @b0rken;
}
elsif (@b0rken == 1) { # just 1 breakage: fix and output
print "$b0rken[0]->[0]$add$b0rken[0]->[1]\n";
}
else { # path looks ok: just output
print;
}
} # end while BORKEN
...
warn $_ for @ambiguous; # report ambiguities
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
G'day Misstre,
Welcome to the Monastery.
I see your own tentative solution, and all that follow, use regexes.
Perl's string handling functions are typically faster than regexes.
Depending on how many "thousands of these mistakes" there are, this might make a difference.
Here's a solution that doesn't use any regexes.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use File::Copy;
my $ref_file = 'ref.txt';
my $full_file = 'full.txt';
my $bu_file = "$full_file.BU";
#---------------------------------------------
# TODO - for demo only; remove for production
copy('original_full.txt', $full_file);
#---------------------------------------------
copy($full_file, $bu_file);
my %ref_paths;
_get_ref_paths($ref_file, \%ref_paths);
{
open my $ifh, '<', $bu_file;
open my $ofh, '>', $full_file;
while (<$ifh>) {
chomp;
my $cmd = substr $_, 5, -1;
my @possibles
= @{_assess_full_path($cmd, \%ref_paths)};
if (@possibles == 1) {
$ofh->print(qq{CMD="$possibles[0]"\n});
}
elsif (@possibles > 1) {
$ofh->print(qq{QRY($.)="$_"\n}) for @possibles;
}
else {
$ofh->print(qq{WTF($.)="$cmd"\n});
}
}
}
#---------------------------------------------
# TODO - for demo only; remove for production
print "\n*** ref file: '$ref_file'\n";
system cat => $ref_file;
print "\n*** bu file: '$bu_file'\n";
system cat => $bu_file;
print "\n*** full file: '$full_file'\n";
system cat => $full_file;
#---------------------------------------------
sub _assess_full_path {
my ($cmd, $ref_paths) = @_;
my $possibles = [];
my $pos = 1 + rindex $cmd, '/';
my $start = substr $cmd, 0, $pos;
my $end = substr $cmd, $pos;
my $max = substr $cmd, 0, rindex($cmd, '.') - 1;
if (exists $ref_paths->{$start}) {
for my $key (keys %{$ref_paths->{$start}}) {
my $dir = "$start$key";
if (0 == index $max, $dir) {
my $full_path
= join '/', $dir, substr $cmd, length $dir;
$full_path =~ y{/}{/}s;
push @$possibles, $full_path;
}
}
}
return $possibles;
}
sub _get_ref_paths {
my ($ref_file, $ref_paths) = @_;
open my $fh, '<', $ref_file;
while (<$fh>) {
chomp;
my $end = substr $_, rindex($_, '/') + 1;
substr $_, rindex($_, '/') + 1, length($_), '';
$ref_paths->{$_}{$end} = 1;
$ref_paths->{"$_$end/"}{''} = 1;
}
return;
}
I dummied up some files to test this.
Here's a sample run's output:
*** ref file: 'ref.txt'
/a
/a/b
/a/b/c
/b
/b/c
/c
/ab
/abc
/abcd
*** bu file: 'full.txt.BU'
CMD="/a/a.sh"
CMD="/aa.sh"
CMD="/ab.sh"
CMD="/abc.sh"
CMD="/a/bc.sh"
CMD="/a/b/c.sh"
CMD="/a/b/c/.sh"
CMD="/a/b/cd.sh"
CMD="/a/b/c/d.sh"
CMD="/x/y.z"
CMD="/a/xyz.sh"
CMD="/abcd.sh"
CMD="/a/very 'special' command.exe"
*** full file: 'full.txt'
CMD="/a/a.sh"
CMD="/a/a.sh"
CMD="/a/b.sh"
QRY(4)="/a/bc.sh"
QRY(4)="/ab/c.sh"
QRY(5)="/a/b/c.sh"
QRY(5)="/a/bc.sh"
CMD="/a/b/c.sh"
WTF(7)="/a/b/c/.sh"
QRY(8)="/a/b/cd.sh"
QRY(8)="/a/b/c/d.sh"
CMD="/a/b/c/d.sh"
WTF(10)="/x/y.z"
CMD="/a/xyz.sh"
QRY(12)="/a/bcd.sh"
QRY(12)="/abc/d.sh"
QRY(12)="/ab/cd.sh"
CMD="/a/very 'special' command.exe"
Notes:
-
You asked about "In place replacement".
This is possible using, for example, the core Tie::File module.
However, I recommend that you make a backup copy: that not only acts as a safety net,
but also can be used as a readonly source from which to create your "fixed" full file.
This is what I did.
-
My 'original_full.txt' is identical to the 'full.txt.BU' shown above;
it allows multiple demo and test runs, using the same data, without needing any manual intervention
(do note the comment: "remove for production").
-
Your programs should always use the strict
and warnings pragmata.
The autodie pragma saves you a lot of tedious and error-prone work;
let Perl handle I/O exceptions for you — I use this a lot and highly recommend it.
-
File::Copy is a core module: you'll have it already; no installation from CPAN required.
It's pretty straightforward; I've only used its copy() function.
-
You talked about a "reference file" that you had but didn't show an example.
I just dummied one up (ref.txt) for my use; replace with your version.
-
The output to 'full.txt' has three types:
-
CMD (Command) - an unambiguous solution was found and is written in the original CMD="..." format.
-
QRY (Query) - more than one potential solution was found;
the line number of the original file is shown; requires a decision and manual intervention.
-
WTF (What's This File) - some form of bogus input was detected;
the line number of the original file is shown.
You may want to delete this, update the reference file, or take some other action;
regardless, manual intervention is required here also.
-
Overall, the code is very straightforward and should run on whatever version of Perl you have.
(For future reference, if you tell us what version you're running,
there may be better solutions using more up-to-date features.)
All of the functions and operators that I've used can be found in the
"Perl Online Documentation";
if you get stumped on anything, just ask.
| [reply] [d/l] [select] |
> Perl's string handling functions are typically faster than regexes.
but in this case you can or all possible paths in a regex,
=~ m/^$path1|$path2|...etc/
and because of automatic Trie optimization this will be significantly faster than checking in a loop.
| [reply] [d/l] [select] |
=~ m/^$path1|$path2|...etc/
I know LanX and kcott understand this, but here's a general side note. In a regex expression like the one quoted above, the ^ anchor is associated only with the first alternation, i.e., ^$path1. None of the other alternations are anchored.
The "precedence" of Perl ordered alternation is very low. This applies generally, so in
$str =~ m/ a b c | d | e | f g h /x;
the regex pattern "atoms" a b c comprise the first possible alternation, then d if the first alternation cannot match, then e, then the f g h sequence.
Use grouping, typically non-capturing, to disambiguate precedence. E.g., in
$str =~ m/ a b (?: c | d e | ... | etc) f g /x;
the sequence a b is required for a match, then the first of c or d e or ... or etc, then the required f g sequence.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
"... in this case ... =~ m/^$path1|$path2|...etc/ ... significantly faster than checking in a loop."
That's a use of alternation with which I'm unfamiliar.
Please enlighten me as to how =~ m/^$path1|$path2|...etc/ could be used to generate
groups of multiple QRY... lines, for the same input line, without a loop. E.g.
QRY(4)="/a/bc.sh"
QRY(4)="/ab/c.sh"
| [reply] [d/l] [select] |
this can't be solved exactly if you have ambiguous paths, like
/a/b/cd/e.sh
/a/b/c/de.sh
both will produce
/a/b/cde.sh
But which one is correct?
You will need to check all possible included paths and then if the "remaining" file exists.
Once you have multiple hits, you'll need to use heuristics.
| [reply] [d/l] [select] |
Hi,
You have been given great advice by much cleverer people than me. My two penneth is to make two copies of the orlriginal file and keep them in different safe places until the task that you need the corrected file for has been completed and forgotten.
I am a stupid man, made slightly wiser by bitter experience, it's best to be prepared for another try.
J.C. | [reply] |