Finding nested directories from a string list.

superfrink has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Finding nested directories from a string list. by Corion (Patriarch) on Feb 24, 2006 at 20:47 UTC
I don't understand why you don't want `C:\\c` and `c:\\car` in your result list. From your explanation, I think your nested hash approach is an elegant solution. I have another solution that relies on the fact that a subdirectory will always sort at a later place than its parent: `use strict; my @paths = ( 'C:\\foo\\bar\\baz', 'C:\\foo\\bar', 'C:\\car', 'C:\\c' ); my $last = ""; my @nested; for (sort @paths) { if (not @nested) { push @nested, $_; } elsif (/^\Q$nested[-1]\E\\/i) { $nested[-1] = $_; } else { push @nested, $_; }; }; print "$_\n" for @nested;` [download] On Windows, we have a case-insensitive file system, so we want to compare case-insensitive. This will possibly fail if you have umlauts in your directory names that should collate to the same character.	[reply] [d/l] [select]
Re^2: Finding nested directories from a string list. by superfrink (Curate) on Feb 24, 2006 at 20:56 UTC
I don't understand why you don't want C:\\c and c:\\car in your result list. Actually I have a list of directories with property X. I need to filter out all instances of nested X-ness. ie any directory with property X must not appear under another directory with property X. In such a case I have to remove the property from the inner directory. It's an application thing. :)	[reply]
Re: Finding nested directories from a string list. by Roy Johnson (Monsignor) on Feb 24, 2006 at 20:30 UTC
Grep for paths that match another up to a backslash. `my @paths = ( 'C:\\foo\\bar\\baz', 'C:\\foo\\bar', 'C:\\car', 'C:\\c' ); my @nested = grep { my $candidate = $_; grep $candidate =~ /^\Q$_\E\\/, @paths } @paths; print "$_\n" for @nested;` [download] Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^2: Finding nested directories from a string list. by QM (Parson) on Feb 24, 2006 at 20:53 UTC
Grep for paths that match another up to a backslash. That's quick and easy, but it's O(n^2). Watch out for big lists. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re^3: Finding nested directories from a string list. by QM (Parson) on Feb 24, 2006 at 21:50 UTC
#!/your/perl/here use strict; use warnings; use Benchmark qw(:all); use List::Util qw/shuffle/; my @words = 'a'..'g'; # create random pathnames my @paths; for (1..(shift or 1000)) { @words = shuffle(@words); push @paths, 'C:\\' . join '\\', @words[0..rand(@words)]; } # original, for test purposes #my @paths = ( # 'C:\\foo\\bar\\baz', # 'C:\\foo\\bar', # 'C:\\car', # 'C:\\c' # ); my $t0 = new Benchmark; my @nested = grep { my $candidate = $_; grep $candidate =~ /^\Q$_\E\\/, @paths } @paths; my $t1 = new Benchmark; print "$_\n" for @nested; print "Processing Time:", timestr(timediff($t1,$t0)), "\n"; exit; [download] For 1000, this takes about 30 seconds on my laptop (1.7GHz), about 100X more than the case for 100. Lots of handwaving would give you 30 minutes for 10K, and 2 days for 100K. Changing my init to match: #!/your/perl/here use strict; use warnings; use Benchmark qw(:all); use List::Util qw/shuffle/; my @words = 'a'..'g'; ## create random pathnames my @paths; for (1..(shift or 1000)) { @words = shuffle(@words); push @paths, 'C:\\' . join '\\', @words[0..rand(@words)]; } #my @paths = ( # 'C:\\foo\\bar\\baz', # 'C:\\foo\\bar', # 'C:\\car', # 'C:\\c' # ); my $t0 = new Benchmark; # remove duplicates my %paths; @paths{@paths} = (1) x @paths; my @nested; # create the tree of paths seen my $tree = {}; foreach my $path ( keys %paths ) { my @dirs = split /\\+/, $path; # insert in hash my $node = $tree; foreach my $dir ( @dirs ) { if ( exists( $node->{LEAF} ) ) { push @nested, $path; } # create the next nested hash if needed $node->{$dir} = {} unless exists $node->{$dir}; # move pointer to that hash $node = $node->{$dir}; } # if there are keys at this level (other than LEAF) my $keys = keys %{$node}; if ( ( $keys > 1 ) or ( ( $keys == 1 ) and ( exists $node->{LEAF} ) ) ) { push @nested, $path; } # mark this as a leaf $node->{LEAF}=1; } my $t1 = new Benchmark; print "$_\n" for @nested; print "Processing Time:", timestr(timediff($t1,$t0)), "\n"; [download] 1000 takes 50ms (not counting the print, just as above). It seems to scale well, with 100K taking 1.1s, 1M taking 2.3s. (Maybe the input needs scaling better -- the redundancy is probably high based on the word count.) -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l] [select]
Re^3: Finding nested directories from a string list. by superfrink (Curate) on Feb 24, 2006 at 21:01 UTC
I mentioned that when I asked about this in the chatterbox this morning. I could have say 1000 directories in my list. I think that's 1 000 000 regex matches. :\ I don't if that is too slow or not.	[reply]
Re^2: Finding nested directories from a string list. by superfrink (Curate) on Feb 24, 2006 at 20:48 UTC
Thanks! :) That is just what I did to begin with but I forgot the tailing backslash. Of course I thought "Hmm I need to detect the component boundary". I didn't follow up that train of thought. For some reason my mind jumped to a hash table to represent the filesystem tree. Today everything has been done "under pressure". :\	[reply]
Re: Finding nested directories from a string list. by QM (Parson) on Feb 24, 2006 at 20:51 UTC
Something like this? #!/your/perl/here use strict; use warnings; my @paths = ( 'C:\\foo\\bar\\baz', 'C:\\foo\\bar', 'C:\\car', 'C:\\c' ); # remove duplicates my %paths; @paths{@paths} = (1) x @paths; # create the tree of paths seen my $tree = {}; foreach my $path ( keys %paths ) { my @dirs = split /\\+/, $path; # insert in hash my $node = $tree; foreach my $dir ( @dirs ) { if ( exists( $node->{LEAF} ) ) { print "$path\n"; } # create the next nested hash if needed $node->{$dir} = {} unless exists $node->{$dir}; # move pointer down a level $node = $node->{$dir}; } # if there are keys at this level (other than LEAF) my $keys = keys %{$node}; if ( ( $keys > 1 ) or ( ( $keys == 1 ) and ( exists $node->{LEAF} ) ) ) { print "$path\n"; } # mark this as a leaf $node->{LEAF}=1; } __OUTPUT__ C:\foo\bar\baz [download] Instead of printing those out, you could just as easily add them to a hash for further processing: `$nested{$path} = 1;` [download] Update: Fixed typo in `exists` that the compiler didn't catch (??). -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l] [select]
Re^2: Finding nested directories from a string list. by superfrink (Curate) on Feb 27, 2006 at 16:24 UTC
This is great QM ! I had to make one change due to the order that keys are pulled out of a hash. I changed the code from: `foreach my $path ( keys %paths ) {` [download] to: `# : N.B. We explicitly iterate over the paths in order of the # number of directory components. This is so we don't create # directories for a child path before creating directories for a # parent path. # : sort the keys to %path by the number of directory components and # then by a string compare. (both comparisons are ascending.) my @sorted_paths = sort { $a =~ tr/\\// <=> $b =~ tr/\\// \|\| $a cmp $b } keys %paths; # : foreach $path in the @sorted_paths array foreach my $path (@sorted_paths) {` [download]	[reply] [d/l] [select]
Re^3: Finding nested directories from a string list. by QM (Parson) on Feb 27, 2006 at 16:58 UTC
As was stated elsewhere in this thread, you can sort on length instead. The parent will always sort before the child: `my @sorted_paths = sort { length($a) <=> length($b) } keys %paths;` [download] -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l]