james28909 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to find any non english files or folders and copy them to a different directory, which everythign is working great except the pattern match.

I have a directory of files and each file in this directory contain a pattern to match with, like this for example:
*\PL\* *\pl-*.fbrb *_pl-*.fbrb *\*_pl-00.fbrb *\pl-00.fbrb *\polish\* *\psarc\polish*.psarc *_POL.* *_POLISH.SUB *_POL_* *_po.xvag *_polish.* *.pl.* *_pl.psarc *_pl2.psarc *_pol.* *_por.*
I slurp this data into an array called @patterns. and then i recursively search the directory tree to find any files or folders that contain matches from the patterns in the @patterns array, which is not working. i am not sure about how to setup the regex match using wildcard chars from an array.
use File::Slurp; use File::Copy::Recursive qw(rmove); use File::Basename; my @patterns; my @dirs; my $searchDIR = $ARGV[0]; search_dirs($searchDIR); get_patterns(); move_files(); sub search_dirs{ my ($dir) = @_; my ($dh); if ( !opendir( $dh, $dir ) ) { return; } while ( my $file = readdir($dh) ) { next if ( -d $file ); my $path = "$dir/$file"; if ( -d $path ) { search_dirs("$path"); } else { push( @dirs, "$dir/$file" ); } } } sub get_patterns{ my @files = read_dir('patterns'); foreach my $element(@files){ open my $file, '<', "patterns/$element"; while(<$file>){ chomp $_; push @patterns, $_; } close($file); } } sub move_files{ for my $element(@dirs){ my($filename, $path, $ext) = fileparse($element, qr/\.[^.]*/); if ( $filename =~ $patterns ) { print "Found $element\n"; #rmove($element, "non-english/$path$filename$ext"); } if ( $path =~ $patterns ) { print "Found $path\n"; #rmove($path, "non-english/$path); } } }
if you need me to give more info please let me know. any insight would be appreciated, and thanks

Replies are listed 'Best First'.
Re: Using an array that contains wildcard characters for pattern matching.
by GrandFather (Saint) on Oct 28, 2014 at 01:42 UTC

    You need to turn file wildcards * (and probably ?) into the equivalent regular expression matches. The following code does that and combines the resulting match strings into a single regular expression. Note that two versions of the pattern are returned to facilitate sorting the match stings in order of most explicit to least explicit. That may not be important, but you should think about the implications.

    use strict; use warnings; my $patFile = <<PATS; *.pl.* *\\PL\\* *\\pl-*.fbrb *_pl-*.fbrb *\\*_pl-00.fbrb *\\pl-00.fbrb *\\polish\\* *\\psarc\\polish*.psarc *_POL.* *_POLISH.SUB *_POL_* *_po.xvag *_polish.* *_pl.psarc *_pl2.psarc *_pol.* *_por.* PATS open my $patIn, '<', \$patFile; my @patterns = map {parsePattern($_)} <$patIn>; my $matchStr = join '|', map {$_->[1]} sort {length($b->[0]) <=> length($a->[0])} @patterns; my $regex = qr/($matchStr)/; print "Match regex is '$matchStr'\n"; while (<DATA>) { chomp; print "Matched '$_' on $1\n" if $_ =~ $regex; } sub parsePattern { my ($path) = @_; chomp $path; (my $explicit = $path) =~ tr/*//d; $path =~ s![\\/]![\\\\/]!g; $path =~ s/\./\\./g; $path =~ s/^\*//; $path .= '$' if $path !~ s/\*$//; $path =~ s/\*/.*/g; $path =~ s/\?/./g; return [$explicit, $path]; } __DATA__ c:\Build\PL\Data\test1.dat c:\Build\Data\test1.dat.wibble_POLISH.SUB c:\Build\Data\test1.dat.wibble_POLISH_SUB

    Prints:

    Match regex is '[\\/]psarc[\\/]polish.*\.psarc$|[\\/].*_pl-00\.fbrb$|[ +\\/]pl-00\.fbrb$|_POLISH\.SUB$|_pl2\.psarc$|[\\/]pl-.*\.fbrb$|_pl-.*\ +.fbrb$|_pl\.psarc$|[\\/]polish[\\/]|_po\.xvag$|_polish\.|_POL\.|_POL_ +|_pol\.|_por\.|\.pl\.|[\\/]PL[\\/]' Matched 'c:\Build\PL\Data\test1.dat' on \PL\ Matched 'c:\Build\Data\test1.dat.wibble_POLISH.SUB' on _POLISH.SUB

    As a further generalization both / and \ path separators are accepted.

    Perl is the programming world's equivalent of English
      this works great until i add anything like \turkish or \localization and i think it has to do with \t ect, but it can be easily rectified i believe. thanks for the great example as well :)
        for further reference i will attach these pattern files and a slightly modified version of my code using your regex.
        use strict; use warnings; use diagnostics; use File::Slurp; use File::Copy::Recursive qw(rcopy fcopy); use File::Basename; my @dirs; my @patterns; my $dir1 = $ARGV[0]; get_filelist($dir1); get_patterns(); move_files(); mkdir('./ripped'); sub get_filelist { my ($dir) = @_; my ($dh); if ( !opendir( $dh, $dir ) ) { return; } while ( my $file = readdir($dh) ) { next if ( -d $file ); my $path = "$dir/$file"; if ( -d $path ) { get_filelist("$path"); } else { push( @dirs, "$dir/$file" ); } } } sub get_patterns{ my @files = read_dir('patterns'); foreach my $element(@files){ open my $file, '<', "patterns/$element"; while(<$file>){ #$_ =~s/[*]/.*/g; #$_ =~s/\\/\//g; $_ =~ s[\\/][\\\\/]g; $_ =~ s/\\/\\\//g; $_ =~ s/\./\\./g; $_ =~ s/^\*//; $_ .= '$' if $_ !~ s/\*$//; $_ =~ s/\*/.*/g; $_ =~ s/\?/./g; chomp $_; $_ = $_.'|'; push @patterns, $_; } close($file); } } sub move_files{ my $regex = join('', @patterns); print $regex; for my $element(@dirs){ if ($element =~ $regex){ print "Found $element\n"; # rcopy($element, "ripped/$element"); } } }
        here is the compelet pattern set. pattern files I thought i had it working, but it was still missing alot of files, and still copying alot of files it wasnt suppose to and i am unsure on which way to go with it now. i have tried replacing '*' with '.*' and '?' with '.', but it is still failing because it is replacing all instances of * and ?

        i have also found Regexp::Wildcards i will see how that works as well
Re: Using an array that contains wildcard characters for pattern matching.
by crashtest (Curate) on Oct 28, 2014 at 01:25 UTC

    It would be helpful if you described how your program isn't working. I gave this a run on my machine and it seemed every file and directory was printed as a match. Is that what you're experiencing?

    As far as insight, I think you'll find putting use strict; at the top of your script (and use warnings;) will help a lot. It reports:

    Global symbol "$patterns" requires explicit package name ... line 57 Global symbol "$patterns" requires explicit package name ... line 62

    The array @patterns is a separate variable from the scalar $patterns, which you're using in your "if" statement. You'll probably want something along the lines of:

    if (grep { $path =~ qr/$_/ } @patterns) ...

    Having said that, I see a second problem in the format of your search patterns. They aren't regular expressions, they're shell-like globs. For example, *\PL\* is not a valid regex; you'd want .*\PL\.*.

      yes it is copying every file, which is NOT what i want it to do ofcourse lol. sorry for leaving that out. and looks like i will need to redo my pattern files. but i want the patterns to help filter out the files i do not want, and just move them to a different directory.

      i now see what you mean about the shell patterns. here is a webpage i just found that describes the problem and a solution. http://bioinfo2.ugr.es/documentation/Perl_Cookbook/ch06_10.htm
Re: Using an array that contains wildcard characters for pattern matching.
by Anonymous Monk on Oct 28, 2014 at 02:32 UTC

    Which part is broken?

      it is the matching part that is nto working, but it was because of the shell patterns. i am checking out grandfathers code, particularly the parsepattern sub.