Parsing a list of files to see if any contain any one of a list of comma delimited strings

OfficeLinebacker has asked for the wisdom of the Perl Monks concerning the following question:

Ok, the title is kind of a tongue twister, so let me start from the beginning:

The people who maintain the databases here have decided that a group of about 30 or so series will be moved to another database. So I need to find all the programs that refer to those series and fix them. I'm only worried about the finding them part right now. So the first step was to do a grep like this:

fst/prod1% find . -name "*.inp" |& grep inp | xargs grep -li '\$open yields'
./CDS/default/uncertainty/i_fin.inp
...snip snip....
./stressind/UpdateReport.inp
./stressind/finfrag1.inp
./stressind/setupdata2.inp

So now I have the list of files I want to search, and a comma delimited list of the series (strings) I want to search for....there may be some linefeeds in there that I want to strip out:

    AAA10YR.B, AAA20YR.B, AAA2YR.B, AAA30YR.B, AAA5YR.B, AA7YR.B, HYTELECOMB2.B, HYTELECOMTAU1.B

etc.

I'm thinking that there has got to be a better way than to brute force it (ie, use X*Y loops where X is the number of files and Y is the number of elements in the comma-delimited list).

This may not even be a Perl problem, but something that can be solved by grep, but I humbly submit my problem to my fellow Monks.

Terrence

UPDATE:

I suppose I could take the comma-delimited list, place each item on its own line, escape the periods, and surround each element with single quotes, writing the result to a file. Then I could just pipe the output of the grep command through one more grep, using the -f (get regexps from a file) option?

UPDATE2: I have to admit I did it in the manner proposed by the first update; I felt that running commands at the prompt one-by-one was easier (remember laziness is a good trait in programmers!) and as stated below, I could track my changes as I applied them rather than writing a whole program and hoping it worked.

The next thing I would like to try is to learn how to do anything sed can do with Perl command-line statements. Any pointers on where I can get to that (also, I am assuming it's possible, since perl was(is?) advertised as a replacement for sed and awk).

I'll go ahead and super search for sed, so don't troubel to respond if you think I'll find it OK.

Thanks again, T.

Comment on Parsing a list of files to see if any contain any one of a list of comma delimited strings

Replies are listed 'Best First'.
Re: Parsing a list of files to see if any contain any one of a list of comma delimited strings by graff (Chancellor) on Apr 21, 2006 at 01:56 UTC
On the one hand, you may be right about this not even being a perl problem, but OTOH, encapsulating this problem into a perl script might be a simple, coherent way to keep track of what the process is really supposed to be (assuming you put some coherent commentary in the script). If I understand the problem, you have a list of patterns to search for in a set of .inp files. If you have that list stored in a file, with commas and whitespace as you have indicated, then the script starts by turning that list into a suitable regex pattern. Then it searches for the files where this regex pattern needs to be sought out. Then you'll do something wiht the list of files where the regex matched: #!/usr/bin/perl use strict; # get the regex pattern my $regex; open( L, "that.list" ) or die "that.list: $!"; { local $/; $_ = <L>; s/^\s+//; # remove initial whitespace, if any s/[,\s]+$//; # remove final comma and whitespace s/\s,\s/\\|/g; # convert internal separators to "\|" s/\./\\./g; # escape period characters $regex = $_; } close L; # now look for candidate files # (I prefer using unix "find"...) my @found; open( F, "find . -name '.inp' \|" ) or die "find: $!"; while (<F>) { chomp; my $datafile = $_; if ( open( DATA, $datafile )) { local $/; $_ = <DATA>; push @found, $datafile if ( /\$open yields/ and /$regex/ ); close DATA; } else { warn "open failed on $datafile: $!\n"; } } # @found now contains the set of file names that are needed [download] Of course, the original list file (containing the patterns to search for) could be a command-line arg or piped in on this script's STDIN, in which case forget the "open(L,..." statement, and read the regex data with `$_ = <>;` (It might be good to state the actual path where the target files are supposed to be found, and put that path into the "find" command in the script, rather than assume that the CWD will be the correct one when the script runs.	[reply] [d/l] [select]
Re^2: Parsing a list of files to see if any contain any one of a list of comma delimited strings by OfficeLinebacker (Chaplain) on Apr 21, 2006 at 13:39 UTC
graf: I guess depending on your areas of expertise and skill, it could be any kind of problem. It seems like the type of problem particularly well suited to Perl, and I think it would be a nice exercise. Eventually I would like to get to the point where I would automatically write up a Perl script in response to this problem (and then, eventually....have it work on the first go!). Since I found this site I am definitely thinking in more Perl-like mode. However I WILL admit that I did the preliminary filtering of the pattern file with sed, one step at a time (both because I wanted to check my progress after each step and am more familiar with one-off command line usage of sed, though I would like to learn how to use Perl like that). Finally, wouldn't one want to use File::Find for the finding part? Thanks for the responses! Edit: I just saw your comment that you "prefer using Unix find"--my apologies.	[reply]
Re^3: Parsing a list of files to see if any contain any one of a list of comma delimited strings by graff (Chancellor) on Apr 22, 2006 at 06:19 UTC
Moving from sed to perl for one-liner operations on the command line (esp. in pipes) will be a lot easier once you get acquainted with the relevant option flags for perl -- browse through perlrun for a wealth of opportunities. Anything you would do with sed -- and a lot more that is hard to conceive of with sed -- is possible using "-e script" along with "-p" or "-n"; awk-ish stuff is done using "-a"; and "-l" can be very handy, as is -M. For lots of simple things, sed is still likely to involve fewer characters to type on the command line (and of course it's likely to run a bit faster), but a lot of things are really not feasible in sed or awk (using executable code in as part of a regex replacement, handling non-ascii character data, etc), but end up being pretty short work in perl. (BTW, I prefer unix "find" because, on any file tree of appreciable size -- thousands of files -- File::Find took about 5 times longer the last time I checked.)	[reply]
Re: Parsing a list of files to see if any contain any one of a list of comma delimited strings by GrandFather (Saint) on Apr 20, 2006 at 22:28 UTC
Super Searching 'files search' finds a plethora of nodes that ask similar questions. The general answer is build a regex from the strings you want to search for then work your way through the files a line at a time looking for matches against the regex. Modules like Regexp::Any and Regexp::Assemble can help a lot with generating the regex for you from a bunch of strings. DWIM is Perl's answer to Gödel	[reply]