Re: Parsing a list of files to see if any contain any one of a list of comma delimited strings

On the one hand, you may be right about this not even being a perl problem, but OTOH, encapsulating this problem into a perl script might be a simple, coherent way to keep track of what the process is really supposed to be (assuming you put some coherent commentary in the script).

If I understand the problem, you have a list of patterns to search for in a set of *.inp files. If you have that list stored in a file, with commas and whitespace as you have indicated, then the script starts by turning that list into a suitable regex pattern. Then it searches for the files where this regex pattern needs to be sought out. Then you'll do something wiht the list of files where the regex matched:

#!/usr/bin/perl

use strict;

# get the regex pattern
my $regex;
open( L, "that.list" ) or die "that.list: $!";
{
    local $/;
    $_ = <L>;
    s/^\s+//;        # remove initial whitespace, if any
    s/[,\s]+$//;     # remove final comma and whitespace
    s/\s*,\s*/\|/g;  # convert internal separators to "|"
    s/\./\\./g;      # escape period characters
    $regex = $_;
}
close L;

# now look for candidate files
# (I prefer using unix "find"...)

my @found;

open( F, "find . -name '*.inp' |" ) or die "find: $!";
while (<F>) {
    chomp;
    my $datafile = $_;
    if ( open( DATA, $datafile )) {
        local $/;
        $_ = <DATA>;
        push @found, $datafile if ( /\$open yields/ and /$regex/ );
        close DATA;
    } else {
        warn "open failed on $datafile: $!\n";
    }
}

# @found now contains the set of file names that are needed
[download]

Of course, the original list file (containing the patterns to search for) could be a command-line arg or piped in on this script's STDIN, in which case forget the "open(L,..." statement, and read the regex data with $_ = <>;

(It might be good to state the actual path where the target files are supposed to be found, and put that path into the "find" command in the script, rather than assume that the CWD will be the correct one when the script runs.

Comment on Re: Parsing a list of files to see if any contain any one of a list of comma delimited strings Select or Download Code

Replies are listed 'Best First'.
Re^2: Parsing a list of files to see if any contain any one of a list of comma delimited strings by OfficeLinebacker (Chaplain) on Apr 21, 2006 at 13:39 UTC
graf: I guess depending on your areas of expertise and skill, it could be any kind of problem. It seems like the type of problem particularly well suited to Perl, and I think it would be a nice exercise. Eventually I would like to get to the point where I would automatically write up a Perl script in response to this problem (and then, eventually....have it work on the first go!). Since I found this site I am definitely thinking in more Perl-like mode. However I WILL admit that I did the preliminary filtering of the pattern file with sed, one step at a time (both because I wanted to check my progress after each step and am more familiar with one-off command line usage of sed, though I would like to learn how to use Perl like that). Finally, wouldn't one want to use File::Find for the finding part? Thanks for the responses! Edit: I just saw your comment that you "prefer using Unix find"--my apologies.	[reply]
Re^3: Parsing a list of files to see if any contain any one of a list of comma delimited strings by graff (Chancellor) on Apr 22, 2006 at 06:19 UTC
Moving from sed to perl for one-liner operations on the command line (esp. in pipes) will be a lot easier once you get acquainted with the relevant option flags for perl -- browse through perlrun for a wealth of opportunities. Anything you would do with sed -- and a lot more that is hard to conceive of with sed -- is possible using "-e script" along with "-p" or "-n"; awk-ish stuff is done using "-a"; and "-l" can be very handy, as is -M. For lots of simple things, sed is still likely to involve fewer characters to type on the command line (and of course it's likely to run a bit faster), but a lot of things are really not feasible in sed or awk (using executable code in as part of a regex replacement, handling non-ascii character data, etc), but end up being pretty short work in perl. (BTW, I prefer unix "find" because, on any file tree of appreciable size -- thousands of files -- File::Find took about 5 times longer the last time I checked.)	[reply]