http://qs1969.pair.com?node_id=1146145

GotToBTru has asked for the wisdom of the Perl Monks concerning the following question:

I am looking for either suggestions for improvement, or ways to use existing modules like File::Find::*.

I wrote a utility to search through file archives organized according to a direction/date/topic structure. I usually know which direction and topic to search, but the transaction may have been archived on a range of days. I wrote my own very limited File::Find (code below) in order to implement this search.

For instance, I am looking for a transaction we sent containing the string "12345678", I know it's for CustomerD, and I'm pretty sure we sent it this week, so it could be in:

outbound/20151027/CustomerD outbound/20151026/CustomerD ... outbound/20151021/CustomerD

outbound/2015nnnn/ will have many subdirectories, and some of them will have hundreds of files. As a result, if I can't supply the topic, I run the search in the background and work on something else. But if I can, the response is quick enough.

So why explore modules if I have a working solution? Learning what's in CPAN, and how to better use it, is to my benefit.

Source code:

#!/home/edi/perl/perl use strict; use warnings; use Getopt::Std; use Date::Calc qw/Today Add_Delta_Days/; getopts('ior:s:d:b:'); our ($opt_i, $opt_o, $opt_r, $opt_s, $opt_d, $opt_b); my ($mode, $search_regex, $days, $business_process); die "Usage: search_si_archive.pl [-[io]] (-s searchstring | -r regex) +[-d daysback] [-b bpname]\n" unless ($opt_s || $opt_r); $mode = 'inbound'; $mode = 'outbound' if ($opt_o); $search_regex = qr/$opt_r/ if ($opt_r); $search_regex = qr/\Q$opt_s\E/ if ($opt_s); $days = defined($opt_d) ? $opt_d : 7; if ($opt_b) { $business_process = '*' . $opt_b . '*' } else { $business_process = '*' } my ($year, $month, $day) = Today(); # for each day from today back $days days while ($days >= 0) { my ($y, $m, $d) = Add_Delta_Days($year, $month, $day, -$days--); my $datestring = sprintf("%d%02d%02d", $y, $m, $d); my $directory = sprintf("/edi_store/archive/%s/%s/%s",$mode,$dates +tring,$business_process); my @dirlist = grep { -d } glob($directory); foreach my $dir (@dirlist) { opendir DIR, $dir; search_file($dir, $_) for (grep { -f $dir . '/' . $_ } readdir + DIR); closedir DIR; } } sub search_file { my $fname = sprintf("%s/%s",@_); open my $fh, '<', $fname; while (<$fh>) { if (m/$search_regex/) { print "$fname\n"; last; } } close($fh); } __END__ =pod =head1 Search SI Archive Search through SI archive directories for a string or regex, restricte +d by age and/or BP. =head1 USAGE search_si_archive.pl -[io] -[sr STRING] [-d DAYS|7] [-b BPNAME] =over =item -i INBOUND - search will start in /edi_store/archive/inbound/ directory t +ree. If neither -i or -o is indicated, this will be the default. =item -o OUTBOUND - search will start in /edi_store/archive/outbound/ directory + tree. =item -s STRING SEARCH - files will be searched for this literal string. Either this or -r must be specified. =item -r STRING REGEX - files will be searched for this regular expression. Either this or -s must be specified. =item -d DAYS DAYS BACK - search will start in today's tree. If this value is specif +ied, the search will be repeated this number of times, moving backward in time one day + with each iteration. If today is Monday, 3 would search today, Sunday, Saturday, and Friday +. If no value is specified, it will search 7 days back. =item -b NAME BUSINESS PROCESS - only directories whose name contains this string wi +ll be searched. If no value is specified, all directories will be searched. =back =head1 Examples =over =item 1. search_si_archive.pl -i -s DEPOT -d 0 -b AS2 Files in subdirectories of /edi_store/archive/inbound/YYYYMMDD whose n +ame contains the string AS2 will be searched for the string DEPOT. =back =head1 Author Howard Parks
Dum Spiro Spero

Replies are listed 'Best First'.
Re: Searching over multiple directories using unusual logic
by shmem (Chancellor) on Oct 27, 2015 at 20:25 UTC

    I don't know your environment, so it is difficult to advocate improvement, or something else. It all depends on where your program is spending too much time, if you are concerned about that. It depends on the number of updates to the searched directories, the timespan searched through, upon the number of files in each, and the size of the files searched. And it all depends on Laziness, Impatience and Hubris.

    Off my head, some things to look at:

    • you could delegate the file search to the Findutils suite, running updatedb in reasonable intervals, building a query for locate and filter the list the shelled out locate returns (impatience)
    • you could - depending on the file size - either use grep or perl to search through the files (impatience)
    • have a look at ack for inspiration, if you are going to satisfy your hubris

    For laziness - "if it ain't broke, don't fix it" and if there are no complaints - just let the working solution in place.

    Hopefully others can provide more ideas...

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'