in reply to Re: Getting all subpaths from a path
in thread Getting all subpaths from a path

Hi!
You are probably right, it becomes an X-Y problem. I will try to do my best to explain the idea of what I'm trying to do and what lead me to opening the current question. As I explained in previous topic (https://www.perlmonks.org/?node_id=11130389), I'm trying to create a bash script "on the fly" output an array of paths that does the following three stages:
1. Create the same directory hierarchy.
2. Copy the files.
3. Create the same links.
For that I can do:
1. I can use mkdir -p to create the full hierarchy based on the path.
2. I can use scp/rsync for copying (as it's inside container).
3. I can use ln -s to create the links.
So I wanted to build a structure that will contain all the information (links, directories, files). I came up with the following structure:
{ "/": { "type": "dir", "files": [ { "usr": { "type": "dir", "files": [ { "vsa": { "type": "link-dir", "source": "/root/site/tools/gauv" } } ] } }, { "root": { "type": "dir", "files": [ { "site": { "type": "dir", "files": [ { "tools": { "type": "dir", "files": [ { "gauv": { "type": "dir", "files": [ { "pkgs": { "type": "dir", "files": [ { "python3": { "type": "dir", "files": [ { "3.6.3a": { "type": "dir-link", "source": "/usr/vsa/pkgs +/python3/3.6.3" } }, { "3.6.3": { "type": "dir", "links": [ { "lib": { "type": "dir", "links": [] } }, { "bin": { "type": "dir", "links": [] } } ] } } ] } } ] } } ] } } ] } } ] } } ] } } ] } }
Which contains only the path /usr/vsa/pkgs/python3/3.6.3a/bin/python3.6 with it's links (as I described in the question). So I will parse each path and create this structure. Once I have this structure, I can extract all of the directories, files, and links (dir links and file links) into arrays and use them to build the bash script (write bash commands based on those paths into a file). That's the purpose of that whole idea.
So my strategy was:
1. Parse each path (by getting all subpaths and links) and insert into an array on all path.
2. Check the type of each path in the array (link, file, directory) and insert into the structure.
3. Extract arrays of dir paths, array of dir links, array of files, array of file links.
4. Iterate over each array and create the bashe script.

I'm having trouble with steps 1-2. In the current node I ask about step 1. I wanted to parse each path and split it into sub paths. Then check each subpath if it's a link and if so, I will insert the target of the link into the array, change all of the other subpaths (for example if I have (/a,/a/b/,/a/b/d) and /a/b->/e/f then it should be (/a,/a/b/,/e/f,/e/f/d)). I should also handle two special cases here:
1. Relative links - not sure how to handle with it currently. I have tried to handle with only local relative links like /a/b/c -> /a/b/d but it's getting complicated.
2. Recursive links - for example /a/b -> /c/d -> e/f ...). That's why I used while(1).


To sum up, those are the big questions:
1. What would be the best design strategy to implement here? Was my idea good?
2. If so, Is my suggested structure good enough? How would you change it?


Now, for what you suggested:
1. The idea is to create a bash script that copies the environment into a container. rsync can help me here but I it can come in handy in step 4 (while copying files, instead of scp). I can't use rsync on the whole directory because it will then copy files that are not in the array of paths. Assume you have in the input array of path (/a/b/1.file, /a/b/2.file) and you also have 3.file inside /a/b. I don't want to copy it, only 1.file and 2.file so rsync on the whole directory won't work here. It can be used to copy files (which is the same as scp).
2. I have tried now splitdir and you are right, it's better than splitting by "/". Is there a subroutine that can give me all the subpaths of a path?
3. Yes it's a good subroutine but I can't use it yet because I need to parse each path - find out if it's a link. abs_path will give me the final path but I also want to have the recursive links (like I mentioned before /a/b -> /c/d -> e/f, in that case abs_path will just get /e/f and ignore /c/d).
I hope this post will clarify some opened question. If not, I will be more than glad to answer more. I'm sorry if I didn't explain the question good enough before. Thanks for the help until now!


Also, some more code that I wrote, while trying to make it work (just for reference. also really sorry it's messy and with bad variable names):
foreach my $f (@arr) { if (-l $f) { print($f, " is a link to ",readlink($f) , "\n"); my @a = split("/",$f); my $result; my $counter = 0; my $last_files_block = $st{"/"}{"files"}; while (1) { unless ($counter < scalar(@a)) { last; } my $x = $a[$counter]; if ($x eq '') { $counter += 1; next; } if ($counter + 1 == scalar(@a)) { if (-f $f) { my $found = 0; foreach my $v (@{$last_files_block}) { if (defined($v->{$x})) { $found = 1; last; } } if ($found == 0) { my %vsaaa = ("type" => "link-file", "source" => re +adlink($f)); my %st1 = ($x => \%vsaaa ); push(@{$last_files_block}, \%st1); } my $last = $f; while (1) { my $c = readlink($last); if (-l $c) { $last = $c; if (index($c,"/") != -1) { push(@arr,$c); } else { my $found1 = 0; foreach my $v (@{$last_files_block}) { if (defined($v->{$x})) { $found = 1; last; } } if ($found1 == 0) { my %vsaaa = ("type" => "file"); my %st1 = ($x => \%vsaaa ); push(@{$last_files_block}, \%st1); } } } else { if (index($c,"/") != -1) { push(@arr,$c); } else { my $found1 = 0; foreach my $v (@{$last_files_block}) { if (defined($v->{$x})) { $found = 1; last; } } if ($found1 == 0) { my %vsaaa = ("type" => "file"); my %st1 = ($x => \%vsaaa ); push(@{$last_files_block}, \%st1); } } last; } } } if (-d $f) { my $found = 0; foreach my $v (@{$last_files_block}) { if (defined($v->{$x})) { $found = 1; last; } } if ($found == 0) { my $n = readlink($f); if (index($n,"/") == -1) { my $dirname = dirname($f); $n = "$dirname/$n"; #TODO: what if relativ +e? } my %vsaaa = ("type" => "dir-link", "source" => + $n); my %st1 = ($x => \%vsaaa ); push(@{$last_files_block}, \%st1); } } last; } my $found = 0; foreach my $v (@{$last_files_block}) { if (defined($v->{$x})) { $last_files_block = $v->{$x}{"files"}; $counter += 1; $found = 1; last; } } if ($found == 0) { my %vsaaa = ("type" => "dir", "files" => [] ); my %st1 = ($x => \%vsaaa ); push(@{$last_files_block}, \%st1); $last_files_block = $vsaaa{"files"}; $counter += 1; } } } elsif (-f $f) { print($f, " is a file\n"); my @a = split("/",$f); my $result; my $counter = 0; my $last_files_block = $st{"/"}{"files"}; while (1) { unless ($counter < scalar(@a)) { last; } my $x = $a[$counter]; if ($x eq '') { $counter += 1; next; } if ($counter + 1 == scalar(@a)) { my $found = 0; foreach my $v (@{$last_files_block}) { if (defined($v->{$x})) { $found = 1; last; } } if ($found == 0) { my %vsaaa = ("type" => "file"); my %st1 = ($x => \%vsaaa ); push(@{$last_files_block}, \%st1); } last; } my $found = 0; foreach my $v (@{$last_files_block}) { if (defined($v->{$x})) { $last_files_block = $v->{$x}{"files"}; $counter += 1; $found = 1; last; } } if ($found == 0) { my %vsaaa = ("type" => "dir", "files" => [] ); my %st1 = ($x => \%vsaaa ); push(@{$last_files_block}, \%st1); $last_files_block = $vsaaa{"files"}; $counter += 1; } } } elsif (-d $f) { print($f, " is a dir\n"); my @a = split("/",$f); my $result; my $counter = 0; my $last_files_block = $st{"/"}{"files"}; while (1) { unless ($counter < scalar(@a)) { last; } my $x = $a[$counter]; if ($x eq '') { $counter += 1; next; } my $found = 0; my $found_link = 0; foreach my $v (@{$last_files_block}) { if (defined($v->{$x})) { if ($v->{$x}{"type"} eq "dir-link" || $v->{$x}{"type"} + eq "link-file") { $found_link = 1; last; } $last_files_block = $v->{$x}{"files"}; $counter += 1; $found = 1; last; } } if ($found_link == 1) { last; } if ($found == 0) { my %vsaaa = ("type" => "dir", "files" => [] ); my %st1 = ($x => \%vsaaa ); push(@{$last_files_block}, \%st1); $last_files_block = $vsaaa{"files"}; $counter += 1; } } } else { #TODO: When can it happen other than path does not exist o +r permission denied? print($f, " is a special\n"); } }

Replies are listed 'Best First'.
Re^3: Getting all subpaths from a path
by haukex (Archbishop) on Apr 02, 2021 at 08:39 UTC

    I think the significant bit of information that was missing previously (the "X" in the XY Problem) is what you mentioned here: "I'm trying to create a Singularity recipes builder." By this I'm guessing you mean Singularity, and their "Recipes" to build containers, more specifically, something you can execute in their Singularity file %post section (which gets executed with /bin/sh) to build the container?

    By "recipes builder", do you mean you want to write a Perl script that will generate commands that can be executed by /bin/sh to reproduce a certain environment (directory structure, links, etc.)? In other words, you want to write a Perl script that will generate a sequence of mkdir -p commands, followed by cp commands, followed by ln -s commands, such that when Singularity builds the container and executes the script containing these commands, those dirs/links/files will be present in the generated squashfs image?

    (By the way, why not use the built-in %files section?)

    Note that I had to deduce all this means you need to describe your task better :-) Remember to explain the "X" you're trying to accomplish, plus sample input, expected output for that input - something like a high-level SSCCE.

    You haven't shown your input, which I am guessing is the filesystem that you want to mirror into the container? One way you could provide an SSCCE for us is to give us a list of commands to recreate the directory structure.

    You also haven't shown your expected output, i.e. the /bin/sh script you want to produce.

    Interesting: Note that both input and output are basically the same thing!

    So if I'm correct with all my guesses so far, the problem can be more or less reduced to: a Perl script that will basically round-trip a /bin/sh script containing mkdir, cp, and ln commands.

    However, since that's a lot of guessing, I'm going to stop here for now - please let us know if the above is correct or not, and if not, what it is you're actually trying to do. (Also, looking over choroba's sample code, it looks like a good starting point.)

      Yes, I'm trying to create those recipes on the fly. User gives me all the paths that he thinks are needed to run the tool inside the container (he gives a file that contains those paths and I read them into an array). With those paths I can build the recipe. In the %setup section I will create the directories, in the %files section I will copy files and in the %post section I will create the links. So I don't really want to create a shell script, I do want to build the recipe with Perl. But I didn't want to talk about Singularity because I guessed most of the people here are not familiar with it. So I tried to simplify it to creating a shell script (aka the recipe) that creates those directories, copies files and creates links.
      So if we moved to talk about recipes, the purpose of the Perl script is to build the recipe, based on all the paths that users thinks are needed for running his tool in the container. So the input is really the paths, as I explained, and the output is the recipe (aka the shell script).
      So my question is still remains. Given the paths, I want to build some structure that I could easily extract all the files/links/directories and use them for creating the recipe file. If you think there is a better way of creating it, I'm all ears.
      choroba's answer is a good start but I had some questions that I commented under it.

        Thanks for the clarification. It's important to know because it tells us the restrictions you're working under, i.e. why the "just use rsync/tar/shar" suggestions weren't what you were looking for (though they could still be used...). I think that's led to some confusion in this thread so far. Anyway:

        User gives me all the paths that he thinks are needed to run the tool inside the container (he gives a file that contains those paths and I read them into an array).

        An important question here is: Would it be correct to assume you have access to the filesystem where these files are located? In other words, the Perl script, the input list of files, and the files themselves are all on the same machine? It's also still unclear to me if you want to mirror the files exactly as they are on the host machine, or if you want to manipulate the paths in any way?

        Again, showing us with code is best, like choroba did in his Makefile. You also still haven't shown what format this list provided by the user looks like. Note that these things will also significantly benefit you in your development since they are at the same time test cases. In other words, the IMHO better way to ask the questions you asked is if you add them as test cases to the SSCCE that choroba provided.

        I think your main concern seems to be this: if a user provides a path that includes a symlink, you want to make sure not to copy only that symlink, but also the target it points to, so that the symlink doesn't end up broken in the container. Another thing that is still unclear to me in this context is whether it is acceptable to you to rewrite any of the symlinks you encounter - for example, potential solutions could rewrite all relative symlinks to absolute ones, or symlink chains could be reduced by simply creating copies.

        Anyway, what I've done here is construct an example that I think demonstrates what you're asking about.

        mkdir -p /tmp/bar /tmp/foo touch /tmp/foo/one ln -fns /tmp/bar /tmp/foo/quz ln -fns ../foo/quz /tmp/bar/baz ln -fns ../foo/one /tmp/bar/two

        In this example, the issue is that if the user were to specify only the path /tmp/bar/baz/two, then you need to figure out that all of /tmp/foo/{one,quz} and /tmp/bar/{two,baz} need to be reconstructed in order for the link to be valid. Here's my attempt at solving this; the tricky bit turned out to be figuring out the dependency chain for the symlinks. sub resolvesymlink is extracted from my script that I linked you to earlier (that includes tests so I'm fairly confident it's decent code, keeping in mind what I said). Note how this essentially does what I said above: round-trip the commands needed to recreate a directory structure.

        Disclaimer: I've so far only tested it for the above test case plus a few variations. Use at your own risk. Though I do hope it's a starting point.

        #!/usr/bin/env perl use warnings; use strict; use File::Basename 'fileparse'; use Cwd qw/getcwd abs_path/; use File::Spec::Functions qw/ splitdir catdir catfile file_name_is_absolute rel2abs rootdir /; use String::ShellQuote 'shell_quote'; use Graph; my @queue = @ARGV; # gather all dirs, files, and links my (%dirs,%files,%links); while ( my $targ = shift @queue ) { $targ = rel2abs($targ); die "does not exist: $targ" unless -e $targ; my @path = splitdir($targ); for my $i (1..$#path) { my $cur = catdir(@path[0..$i]); if ( -l $cur ) { defined( $links{$cur} = readlink($cur) ) or die "readlink $cur: $!"; # enqueue everything in the link chain # (excluding already seen symlinks) push @queue, grep { !$links{$_} } resolvesymlink($cur); } elsif ( -f $cur ) { $files{abs_path($cur)}++ } elsif ( -d $cur ) { $dirs{abs_path($cur)}++ } else { warn "skipping $cur, unknown type" } } } # simplify the dirs to shorten the mkdir command my $dg = Graph->new; for my $d (keys %dirs) { my @s = splitdir($d); $dg->add_edge(catdir(@s[0..$_]), catdir(@s[0..$_-1])) for 1..$#s; } # exterior vertices = leaves of the tree my @dirs = grep { $_ ne rootdir } sort $dg->exterior_vertices; print "mkdir -p ",shell_quote(@dirs),"\n" if @dirs; # output the files print "touch ",shell_quote(sort keys %files),"\n" if %files; # determine dependencies in symlinks via a topological sort my $lg = Graph->new; for my $l (keys %links) { my @res = resolvesymlink($l); die "unexpected resolvesymlink($l)" if @res<2; $lg->add_edge($l, $res[1]); # link depends on its target my @s = splitdir($l); for my $i (reverse 1..$#s-1) { my $d = catdir(@s[0..$i]); # if there's a link in the paths, this link depends on it too $lg->add_edge($l, $d) if defined $links{$d}; } } my @links = reverse grep { defined $links{$_} } $lg->topological_sort; print "ln -snf ",shell_quote($links{$_}, $_),"\n" for @links; # from https://bitbucket.org/haukex/htools/src/master/relink (a500e09) sub resolvesymlink { my $file = shift; die "not absolute: $file" unless file_name_is_absolute($file); my @files; my $origwd = getcwd; my $rv = eval { # in eval so orig working dir is always restored my $f = $file; while (1) { my $dir; ($f,$dir) = fileparse($f); last unless -d $dir; chdir $dir or die "chdir $dir: $!"; push @files, catfile(getcwd,$f); last unless -l $f; defined( $f = readlink $f ) or die "readlink $f (cwd=".getcwd."): $!"; } 1 }; my $err = $@||'unknown error'; chdir $origwd or die "chdir $origwd: $!"; die $err unless $rv; return @files ? @files : ($file); } __END__ mkdir -p /tmp/bar /tmp/foo touch /tmp/foo/one ln -snf /tmp/bar /tmp/foo/quz ln -snf ../foo/one /tmp/bar/two ln -snf ../foo/quz /tmp/bar/baz ln -snf ../foo/one /tmp/bar/baz/two

        Update: Added the if @dirs and if %files to the two prints.

Re^3: Getting all subpaths from a path
by ovedpo15 (Pilgrim) on Mar 31, 2021 at 13:00 UTC
    Can someone suggest on strategy on how to solve it? I tried some other similar things but it got too complicated and failed.
      You cannot create the links before both the source and target directories exist. So, postpone their creation to the end. The directories are processed from the shortest to the longest, so we always know the parent path exists. You should also add a check that a link doesn't point outside of the given directory tree.

      I created the following Makefile to experiment with your data:

      And this was the script 1.pl:

      #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use Path::Tiny qw{ path }; my $path = path(shift); my $paths = $path->visit( sub { $_[1]->{$_} = (-f) ? 'file' : (-l) ? [readlink] : (-d) ? 'dir' : 'unknown' }, { recurse => 1 } ); my @links; for my $found (sort { length $a <=> length $b } keys %$paths) { if ('file' eq $paths->{$found}) { say qq(cp '$found' "\$target/$found"); } elsif ('dir' eq $paths->{$found}) { say qq(mkdir "\$target/$found"); } elsif (ref [] eq ref $paths->{$found}) { my $to = path($paths->{$found}[0]); $to = $to->relative(path($found)->absolute->parent) if $to->is_absolute; my $source = path('$target', $found); push @links, qq(ln -s "$to" "$source"); } } say for @links;

      It's just a toy. Use some of the ShellQuote modules to fix the filenames; but for the example given, it works.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        Hi choroba! Thank you for your input as always. I have a few questions:
        1. As I understand the first part does opendir (somewhere) on the path my $path = path("/usr/vsa/pkgs/python3/3.6.3a/bin/python3.6"); and insert all of the files inside the directory into $paths but I'm interested only in the path itself. The input of the script is an array of paths. Each path (and only those paths) interest me. So if for example I have /usr/vsa/pkgs/python3/3.6.3a/bin/python3.6a (note the "a" at the end) and it does not exist in the array of paths, I don't want to copy it (only /usr/vsa/pkgs/python3/3.6.3a/bin/python3.6. See more info at (*)).
        2. It does opendir for some reason so it fails for files (the path /usr/vsa/pkgs/python3/3.6.3a/bin/python3.6 is a file and it fails with: Error opendir on '/usr/vsa/pkgs/python3/3.6.3a/bin/python3.6': Not a directory at line 113). Line 113 is "{ recurse => 1 }"
        3. It looks like path can't handle with some symlinks. I have a link /a -> /c/d/e and path("/a") returns nothing, while path("/a/b") returns paths under "/c/d/e/b".
        4. Also what if there is a link /a/ -> /b/ -> /c? Should it work?

        The cp/mkdir/ln part I will do myself. I'm just struggling to build the arrays of paths that are involved (and only them) - splitted into categories (file-links, dir-links, files, dirs).

        * A bit more explanation about the "involved paths": I have an array @array=("/usr/vsa/pkgs/python3/3.6.3a/bin/python3.6", ...). I want to iterate over this array and mkdir the directories that exist in the array (recursively), copy the files (not the directories but I want to have only the paths that are located in @array) and set the same links. Think of that as a container which you want to run your tool on and you know those paths are needed for running your tool (but any other path is not needed). So if your tool uses /a/b/c.file and under /a/b you also have /a/b/d.file, then it will create /a/b directory and copy /a/b/c to $target/a/b/c (and set links if needed). If I wanted to copy all of the files under directory, I would just copy the directory (instead of mkdir and cp files under directory).

      It seems like you are trying to re-invent 'tar'.

        Hi tybalt89, thanks for the comment. Do you talk about tarball (tar)? Then no. I'm trying to create a Singularity recipes builder. So I want to have the same "environment" (files/dir/links) as outside. In other words, the script should create a def file that creates directories, copies the files and creates the link - only of the paths I gave him. If you are not familiar with Singularity, it's similar to Docker containers but you can copy stuff from your work area.

      Sorry, I've been quite busy, but I hope to get a chance to read your post tomorrow.

        That's ok! My comment was for all the Monks :) (Sorry that it's under your post, my detailed explanation is in this thread).