superfrink has asked for the wisdom of the Perl Monks concerning the following question:

I have searched for an answer with out luck. I have a tried a few things without 100% success. The problem is I have a list of directory paths. I want to get a list of paths that are under other paths.

For this list of paths:
my @paths = ( 'C:\\foo\\bar\\baz', 'C:\\foo\\bar', 'C:\\car', 'C:\\c' );
I want to get back a list like:
my @nested = ( 'C:\\foo\\bar\\baz' );

I tried building a hash of hashes to represent the tree.
$VAR1 = { 'C:' => { 'c' => {}, 'car' => {}, 'foo' => { 'bar' => { 'baz' => {} } } } };

Now I can get the leaf nodes but that ends up being:
my @nested = ( 'C:\\c', 'C:\\car', 'C:\\foo\\bar\\baz' );
Maybe I can take the list of paths and remove the list of leaf paths, this gives me the parent list. Then find the children of parents.

Any other suggestions?

Replies are listed 'Best First'.
Re: Finding nested directories from a string list.
by Corion (Patriarch) on Feb 24, 2006 at 20:47 UTC

    I don't understand why you don't want C:\\c and c:\\car in your result list. From your explanation, I think your nested hash approach is an elegant solution. I have another solution that relies on the fact that a subdirectory will always sort at a later place than its parent:

    use strict; my @paths = ( 'C:\\foo\\bar\\baz', 'C:\\foo\\bar', 'C:\\car', 'C:\\c' ); my $last = ""; my @nested; for (sort @paths) { if (not @nested) { push @nested, $_; } elsif (/^\Q$nested[-1]\E\\/i) { $nested[-1] = $_; } else { push @nested, $_; }; }; print "$_\n" for @nested;

    On Windows, we have a case-insensitive file system, so we want to compare case-insensitive. This will possibly fail if you have umlauts in your directory names that should collate to the same character.

      I don't understand why you don't want C:\\c and c:\\car in your result list.

      Actually I have a list of directories with property X. I need to filter out all instances of nested X-ness. ie any directory with property X must not appear under another directory with property X. In such a case I have to remove the property from the inner directory. It's an application thing. :)
Re: Finding nested directories from a string list.
by Roy Johnson (Monsignor) on Feb 24, 2006 at 20:30 UTC
    Grep for paths that match another up to a backslash.
    my @paths = ( 'C:\\foo\\bar\\baz', 'C:\\foo\\bar', 'C:\\car', 'C:\\c' ); my @nested = grep { my $candidate = $_; grep $candidate =~ /^\Q$_\E\\/, @paths } @paths; print "$_\n" for @nested;

    Caution: Contents may have been coded under pressure.
      Grep for paths that match another up to a backslash.
      That's quick and easy, but it's O(n^2). Watch out for big lists.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        #!/your/perl/here use strict; use warnings; use Benchmark qw(:all); use List::Util qw/shuffle/; my @words = 'a'..'g'; # create random pathnames my @paths; for (1..(shift or 1000)) { @words = shuffle(@words); push @paths, 'C:\\' . join '\\', @words[0..rand(@words)]; } # original, for test purposes #my @paths = ( # 'C:\\foo\\bar\\baz', # 'C:\\foo\\bar', # 'C:\\car', # 'C:\\c' # ); my $t0 = new Benchmark; my @nested = grep { my $candidate = $_; grep $candidate =~ /^\Q$_\E\\/, @paths } @paths; my $t1 = new Benchmark; print "$_\n" for @nested; print "Processing Time:", timestr(timediff($t1,$t0)), "\n"; exit;
        For 1000, this takes about 30 seconds on my laptop (1.7GHz), about 100X more than the case for 100. Lots of handwaving would give you 30 minutes for 10K, and 2 days for 100K.

        Changing my init to match:

        #!/your/perl/here use strict; use warnings; use Benchmark qw(:all); use List::Util qw/shuffle/; my @words = 'a'..'g'; ## create random pathnames my @paths; for (1..(shift or 1000)) { @words = shuffle(@words); push @paths, 'C:\\' . join '\\', @words[0..rand(@words)]; } #my @paths = ( # 'C:\\foo\\bar\\baz', # 'C:\\foo\\bar', # 'C:\\car', # 'C:\\c' # ); my $t0 = new Benchmark; # remove duplicates my %paths; @paths{@paths} = (1) x @paths; my @nested; # create the tree of paths seen my $tree = {}; foreach my $path ( keys %paths ) { my @dirs = split /\\+/, $path; # insert in hash my $node = $tree; foreach my $dir ( @dirs ) { if ( exists( $node->{LEAF} ) ) { push @nested, $path; } # create the next nested hash if needed $node->{$dir} = {} unless exists $node->{$dir}; # move pointer to that hash $node = $node->{$dir}; } # if there are keys at this level (other than LEAF) my $keys = keys %{$node}; if ( ( $keys > 1 ) or ( ( $keys == 1 ) and ( exists $node->{LEAF} ) ) ) { push @nested, $path; } # mark this as a leaf $node->{LEAF}=1; } my $t1 = new Benchmark; print "$_\n" for @nested; print "Processing Time:", timestr(timediff($t1,$t0)), "\n";
        1000 takes 50ms (not counting the print, just as above). It seems to scale well, with 100K taking 1.1s, 1M taking 2.3s. (Maybe the input needs scaling better -- the redundancy is probably high based on the word count.)

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

        I mentioned that when I asked about this in the chatterbox this morning. I could have say 1000 directories in my list. I think that's 1 000 000 regex matches. :\ I don't if that is too slow or not.
      Thanks! :)

      That is just what I did to begin with but I forgot the tailing backslash. Of course I thought "Hmm I need to detect the component boundary". I didn't follow up that train of thought. For some reason my mind jumped to a hash table to represent the filesystem tree.

      Today everything has been done "under pressure". :\
Re: Finding nested directories from a string list.
by QM (Parson) on Feb 24, 2006 at 20:51 UTC
    Something like this?
    #!/your/perl/here use strict; use warnings; my @paths = ( 'C:\\foo\\bar\\baz', 'C:\\foo\\bar', 'C:\\car', 'C:\\c' ); # remove duplicates my %paths; @paths{@paths} = (1) x @paths; # create the tree of paths seen my $tree = {}; foreach my $path ( keys %paths ) { my @dirs = split /\\+/, $path; # insert in hash my $node = $tree; foreach my $dir ( @dirs ) { if ( exists( $node->{LEAF} ) ) { print "$path\n"; } # create the next nested hash if needed $node->{$dir} = {} unless exists $node->{$dir}; # move pointer down a level $node = $node->{$dir}; } # if there are keys at this level (other than LEAF) my $keys = keys %{$node}; if ( ( $keys > 1 ) or ( ( $keys == 1 ) and ( exists $node->{LEAF} ) ) ) { print "$path\n"; } # mark this as a leaf $node->{LEAF}=1; } __OUTPUT__ C:\foo\bar\baz
    Instead of printing those out, you could just as easily add them to a hash for further processing:
    $nested{$path} = 1;

    Update: Fixed typo in exists that the compiler didn't catch (??).

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      This is great QM ! I had to make one change due to the order that keys are pulled out of a hash. I changed the code from:
      foreach my $path ( keys %paths ) {
      to:
      # : N.B. We explicitly iterate over the paths in order of the # number of directory components. This is so we don't create # directories for a child path before creating directories for a # parent path. # : sort the keys to %path by the number of directory components and # then by a string compare. (both comparisons are ascending.) my @sorted_paths = sort { $a =~ tr/\\// <=> $b =~ tr/\\// || $a cmp $b } keys %paths; # : foreach $path in the @sorted_paths array foreach my $path (@sorted_paths) {
        As was stated elsewhere in this thread, you can sort on length instead. The parent will always sort before the child:
        my @sorted_paths = sort { length($a) <=> length($b) } keys %paths;

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of