ovedpo15 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!
I have a utility which gets a list of different paths and creates an instructions file. It works great but I got into scaling problem when trying to add new features so I'm trying to rethink my strategy. That instructions file is bunch of commands that can be divided in four sections. In the first second it creates the directories using "mkdir -p", in the second section it copies areas into the created directories using "cp -r", in the third sections it copies files using "cp" and in the fourth section it creates links using "ln -s".
Consider the following example:
# Create dirs mkdir -p $IN/a/b/c mkdir -p $IN/a/b/d mkdir -p $IN/x/y/z # Copy areas cp -r $OUT/a/b/c/dep_dir1 $IN/a/b/c cp -r $OUT/a/b/d/dep_dir2 $IN/a/b/d # Copy files cp $OUT/x/y/z/dep_file $IN/x/y/z # Create links ln -s $IN/a/b/c $IN/p
Note that $OUT represents the "outside world" where $IN represents the "inside world". You could think of them as two different directories.

I'm trying to figure a proper data structure that will contain some sort of that filesystem representation (without $IN/$OUT). The idea is that you could take a look at this data structure, and see which dependency is under which area, which path points to another path, etc. My thoughts are pointing me to Graph data structure where each vertex is a directory/file/link/area. If it's a file, then it does not have any dependencies under it. If it's a link then it has one edge to another vertex. If it's a directory then it could point to zero or more vertices. Note that the edge that comes from a directory is different then the edge that comes from a link. Now, area is a special type of vertex so it should have some flag aside it. The definition of an "area" is a directory which user provided in a custom list of dirs. In that case, it will copy the whole area, and not just the files.
Basically I want to be able to do the following queries on the data structure (to create the above example):
1. Insert a link/file/directory into the DS.
2. Get a list of all the directories (for the first sections). For example, I could do something like print $fh "mkdir -p $_" foreach (get_dirs($DS)). 3. Get a list of all the areas (for the second section).
4. Get a list of all the files (for the third section).
5. Get a list of the links (for the fourth section). This part is a bit tricky because one link could be dependent on another one and therefore should be set first.

As I mentioned I already have some initial working script (which is very long and complicated). But the general idea is to split the paths into four hashes (the keys are the path and the values are just the number 1 just to make it easier to see if the path already inside the hash). But This way it is really not scalable. For example, I want to add a new feature where I can use virtual paths instead of logical paths. For example, in the example above, I could replace /p with /a/b/c in all locations.

Since it points me to Graph DS, I was looking at the Graph package. But the problem is that files/dirs are not unique. For example /a/b/c/file1 and /x/y/z/file1 are two completely different files but with the same name. Also, I want to support relative paths in the target of the links such as /a/b/c/d -> ../../x/y/z. How can I represent the filesystem in a proper DS?

EDIT: Just to make it more clear, please consider the following example:
/a/b/c/d/file1 /x/y/z/link1 -> file1 /x/y/w/area1 /u/v -> /a/b/c
In that case, you get the following graph: https://pasteboard.co/ua4n0o3nwVOY.png (Not sure how to paste image here since img tag is not supported). I was wondering if creating a class called PathsTree that contains one node which represents the root and another class called node. Each time I add a new path, I'll have to split the path by using splitdir and the right node. Does it sound like a valid way?

Replies are listed 'Best First'.
Re: How to represent the filesystem in DS?
by Fletch (Bishop) on Mar 26, 2022 at 18:21 UTC

    I'd try and track down a copy of The Design and Implementation of the 4.4BSD Operating System ( ISBN 9780768684940 ), specifically Chapter 7: Local Filesystems. While that's (probably) more than you're trying to do, understanding how UFS actually is implemented and the data structures they used can't hurt and should give relevant inspiration.

    For instance WRT to your last question of different files with the same name the way I'd approach that would be similar to how most *NIX FSen do it: separate directory entries and naming references from content stored in inodes. You'd have one inode with id XYZ that the name /a/b/c/file1 points to (however), and a different inode of contents DEF that's referenced from the name /x/y/z/file1. If you really wanted to get kind of fancy you could (maybe) use a hash (e.g. sha1 of the contents) as your "inode number" and then you could (for instance) deduplicate things by just creating links to the same underlying storage (instantiated by creating hard links in your real FS when you actually create whatever you're modeling).

    Edit: tweaked phrasing about hash. Also another reference for ext2 from Linux.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      Thank you for the reply!
      I'm actually familiar with inodes and the way they work (from my OS course back at the uni). Not sure but feels like an overkill (since I just want to keep those paths in a simple structure). Also, is there some model that does something similar maybe? Tried to look in cpan and didn't find anything interesting, but I might missed it.

        Now when I'm saying "inode" that's probably a bad name for it because no you don't really need to do a full "inode" with indirect/double indirect pointers to storage blocks and what not (you can just cheat and use your actual underlying filesystem for storage and keep just a path to the contents instead which you read into the destination as required when you actually need to realize things).

        As for existing modules, there might be some helper stuff in something meant to be used with Fuse to implement a filesystem but that's a pure guess. As you've seen Graph isn't quite going to do it because you'd (maybe; warning ENOCAFFEINE and this is off the cuff) want to store attributes (the name a target node is to be referenced as) on the edges. Something like REST::Neo4p leveraging a graph database would get you closer to that model but I think that'd would be even worse overkill.

        I'd just start simple with a Directory instance which has many DirectoryEntries, which have a name and point to a target. A target would be maybe either a FileContents instance containing, erm, the contents (or a pointer to a real file with) or a SymbolicLink being the name of the target. Once you've got those work from there. Edit: Derp, just saw your edit to the original which is along similar lines; so yeah that's reasonable.

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

Re: How to represent the filesystem in DS?
by etj (Priest) on Apr 04, 2022 at 14:21 UTC
    Using the Graph module would save some code. The only problem you'd face is in making vertex (aka node) names which need to be unique, and you could probably solve that by making them be the full absolute path, which in filesystem semantics is unique. Then you'd use vertex and/or edge attributes for the other things, e.g. vertex "type" which could be "file", "dir", "link".

    You'd need to have logic to avoid "dir" having "contain" edges to "file"/"link"/"dir" vertices that didn't have the right names. An alternative is to just have inode numerical-ID type vertex-IDs, with a "name" attribute, and have the semantics even closer to a real filesystem.

    It's not so much that Graph can solve all your problems, but it would allow you to simply ignore/take for granted some of them.

      Some notes on names vs. IDs:

      Using the absolute path to identify a file is what happens e.g. with a FAT filesystem.

      Many filesystems that have a Unix history (Unix File System, MINIX file system, Extended file system family, ...) identify each file by a numeric ID. (See also inode.) Directories are just special files that associate names with file IDs.

      One quite obvious difference is that the file ID does not change when you move or rename a file.

      Another major difference of the numeric ID is that files can have zero or more names. Having no name for a file is useful for private temporary files. There is simply no way to access them except by having an open file handle. Having more than one name for a file is generally known as hardlink, and allows to refer the same file from different directories using different names. This is used e.g. by BusyBox.

      On FAT filesystems, having more than one name for the same file is generally considered a filesystem error.

      (Note that Windows Long filename support a.k.a. VFAT adds an optional alias for each part of the path. Both the short and long filenames still have to be unique. Both can be used interchangeably for each part of the path.)

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: How to represent the filesystem in DS?
by ovedpo15 (Pilgrim) on Apr 03, 2022 at 06:56 UTC
    So I started working on a tree-like DS for my data. I would really appreciate to get a small code review! (more about the approach then the code itself)
    I'll explain what I did first: Basically I have four classes. The FileNode represents a file , DirNode represents a directory, LinkNode represents a link and PathsTree represents the tree of nodes. Dir node has a hash of nodes that it points to. The Link node has one node that it points to. Now, I also added display_tree method for debugging purposes (you can't ignore it). The code of LinkNode:
    package LinkNode; sub new { my ($class,$node_name,$node_target,$link_type) = @_; my $self = { "nodes" => {}, "type" => $link_type, "name" => $node_name, "target" => $node_target }; bless $self, $class; return $self; } sub get_child_node_by_name { my ($self,$node_name) = @_; if (defined($self->{"nodes"}{$node_name})) { return $self->{"nodes"}{$node_name}; } return undef; } sub display_tree { my ($self,$prefix) = @_; print($prefix." ".$self->{"name"}." -> ".$self->{"target"}."\n"); } 1;
    The code of FileNode:
    package FileNode; sub new { my ($class,$node_name) = @_; my $self = { "type" => "file", "name" => $node_name }; bless $self, $class; return $self; } sub display_tree { my ($self,$prefix) = @_; print($prefix." ".$self->{"name"}."\n"); } 1;
    The code of DirNode:
    package DirNode; use file_node qw(:all); use link_node qw(:all); sub new { my ($class,$node_name,$node_type) = @_; my $self = { "nodes" => {}, "type" => $node_type, "name" => $node_name }; bless $self, $class; return $self; } sub get_child_node_by_name { my ($self,$node_name) = @_; if (defined($self->{"nodes"}{$node_name})) { return $self->{"nodes"}{$node_name}; } return undef; } sub add_child_node { my ($self,$node_name,$node_type) = @_; if ($node_type eq "dir") { my $new_node = new DirNode($node_name); $self->{"nodes"}{$node_name} = $new_node; return $new_node; } elsif ($node_type eq "file") { my $new_node = new FileNode($node_name); $self->{"nodes"}{$node_name} = $new_node; return $new_node; } return undef; } sub add_child_link_node { my ($self,$node_name,$node_target,$node_type) = @_; my $new_node = new LinkNode($node_name,$node_target,$node_type); $self->{"nodes"}{$node_name} = $new_node; return $new_node; } sub get_dirs { my ($self) = @_; my $path = $self->{"name"}; my %paths; foreach my $current_node (values(%{$self->{"nodes"}})) { if ($current_node->isa('DirNode')) { my %subpaths = $current_node->get_dirs(); foreach my $subpath (keys(%subpaths)) { my $updated_path = catdir($path,$subpath); $paths{$updated_path} += 1; } } } if (scalar(keys(%paths)) == 0) { $paths{$path} += 1; } return %paths; } sub get_files { my ($self) = @_; my $path = $self->{"name"}; my %paths; foreach my $current_node (values(%{$self->{"nodes"}})) { if ($current_node->isa('DirNode')) { my %subpaths = $current_node->get_files(); foreach my $subpath (keys(%subpaths)) { my $updated_path = catdir($path,$subpath); $paths{$updated_path} += 1; } } elsif ($current_node->isa('FileNode')) { my $updated_path = catdir($path,$current_node->{"name"}); $paths{$updated_path} += 1; } } return %paths; } sub get_links { my ($self) = @_; my $path = $self->{"name"}; my %paths; foreach my $current_node (values(%{$self->{"nodes"}})) { if ($current_node->isa('DirNode')) { my %subpaths = $current_node->get_links(); foreach my $subpath (keys(%subpaths)) { my $updated_path = catdir($path,$subpath); $paths{$updated_path} = $subpaths{$subpath}; } } elsif ($current_node->isa('LinkNode')) { my $target_path = $current_node->{"target"}; my $updated_path = catdir($path,$current_node->{"name"}); $paths{$updated_path} = $target_path; } } return %paths; } sub display_tree { my ($self,$prefix) = @_; print($prefix." ".$self->{"name"}."\n"); foreach my $current_node (values(%{$self->{"nodes"}})) { $current_node->display_tree($prefix."-"); } } 1;
    And the most important part - the code of PathsTree:
    package PathsTree; use strict; use dir_node qw(:all); use Cwd qw(cwd abs_path getcwd); use File::Basename qw(fileparse); use File::Spec::Functions qw(file_name_is_absolute splitdir catfile ca +tdir); sub new { my ($class) = @_; my $self = { _root_node => new DirNode('/',"dir") }; bless $self, $class; return $self; } sub display_tree { my ($self) = @_; return $self->{"_root_node"}->display_tree("-"); } sub get_links { my ($self) = @_; return $self->{"_root_node"}->get_links(); } sub get_files { my ($self) = @_; return $self->{"_root_node"}->get_files(); } sub get_dirs { my ($self) = @_; return $self->{"_root_node"}->get_dirs(); } sub get_node_by_path { my ($self,$path) = @_; my @subpaths = splitdir($path); my $prev_node = $self->{_root_node}; for my $index (0..$#subpaths) { next if (@subpaths[$index] eq ""); my $current_node_name = @subpaths[$index]; my $current_child = $prev_node->get_child_node_by_name(@subpat +hs[$index]); unless ($current_child) { return undef; } $prev_node = $current_child; } return $prev_node; } sub add_path { my ($self,$path) = @_; my @subpaths = splitdir($path); my $prev_node = $self->{_root_node}; for my $index (0..$#subpaths) { next if (@subpaths[$index] eq ""); # Ignore empty strings my $current_path = catdir(@subpaths[0..$index]); my $current_node_name = @subpaths[$index]; if (-l $current_path) { my @resloved_links = resolve_symlink($current_path); foreach my $link (@resloved_links) { if ($link eq $current_path) { next; } $self->add_path($link); } my $current_link_target = readlink($current_path); my $child_node = $self->get_node_by_path($current_path); if ($child_node) { $prev_node = $child_node; } else { $child_node = $prev_node->add_child_link_node($current +_node_name,$current_link_target,"regular_link"); $prev_node = $child_node; } $prev_node = $self->get_node_by_path($current_link_target) +; } elsif (-f $current_path) { my $child_node = $self->get_node_by_path($current_path); if ($child_node) { $prev_node = $child_node; } else { $child_node = $prev_node->add_child_node($current_node +_name,"file"); $prev_node = $child_node; } } elsif (-d $current_path) { my $child_node = $self->get_node_by_path($current_path); if ($child_node) { $prev_node = $child_node; } else { $child_node = $prev_node->add_child_node($current_node +_name,"dir"); $prev_node = $child_node; } } else { log("Ignoring path with unknown type: $current_path"); } } } # TODO: Move to a separated file sub resolve_symlink { my ($file) = @_; unless (file_name_is_absolute($file)) { log("Failed to resolve link $file"); return; } my @files; my $origwd = getcwd; my $rv = eval { my $f = $file; while (1) { my $dir; ($f,$dir) = fileparse($f); last unless (-d $dir); chdir $dir or die "chdir $dir: $!"; push @files, catfile(getcwd,$f); last unless (-l $f); defined( $f = readlink $f ) or die "readlink $f (cwd=".get +cwd."): $!"; } 1 }; my $err = $@ || 'unknown error'; chdir $origwd or (log("Failed to chdir $origwd: $!") && return); die $err unless ($rv); return @files ? @files : ($file); } 1;
    Test case:
    my $paths_tree = new PathsTree(); foreach my $path (keys(%paths)) { $paths_tree->add_path($path); } $paths_tree->display_tree(); my %dirs = $paths_tree->get_dirs(); my %files = $paths_tree->get_files(); my %links = $paths_tree->get_links();
    My thoughts about my code: I got very entangled with the links part. The whole idea was to make it scalable and it's got difficult to follow the code in the "link-handler" section of add_path method. Also, I probably should use here inheritance - some Node class with basic methods which DirNode, FileNode and LinkNode inherit from.

    My code probably contains bugs but I ask you this - is it the right approach? If you remember from the intro, the main feature that I want to add is to use virtual paths instead of logical. So once I done building the graph, I want to provide a list of virtual paths. For example, in the first example that I provided above, if user gave me /p then I should replace every occurrence of /a/b/c with /p in the graph. Do you think it will not be hard to do with my current approach?

    P.S. originally this post was part of the main post but as suggested, I moved it to a separated reply.
Re: How to represent the filesystem in DS?
by ovedpo15 (Pilgrim) on Mar 30, 2022 at 07:23 UTC
    If someone can comment about my code/suggestion, I will really appreciate it!

      For future reference a "solution" or follow up to your original node will get more notice if you reply to your original node rather than updating that node. It also make it easier for us to see the original problem and the proposed solution. Note that you can go back and cut the solution section and post that as a new reply.

      Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
        Thanks for your suggested! Moved to a separated reply.