in reply to How to efficiently search for list of strings in multiple files?
G'day stray_tachyon,
There's a number of things you can do to improve this code. Here's a list of some of the things I found.
I put together a short script to show how all of those points might be implemented. I also dummied up some highly contrived data just to give the script something to work on.
Here's union.txt (note the second line has an extra space):
$ cat union.txt ABC Union XYZ Union
I then created a number of very short files in two directories. These have one, two or no matches; one has a name spread over two lines with a slew of extra whitespace; one file is completely empty.
$ for i in agreements other; do for j in `ls $i`; do echo "*** $i/$j * +**"; cat $i/$j; done; done *** agreements/abc.txt *** ... ABC Union ... *** agreements/abc_xyz.pdf *** ... XYZ Union and ABC Union ... *** agreements/def.txt *** ... DEF Union ... *** agreements/pqrpdf *** ... temp data ... *** agreements/xyz.pdf *** .................. XYZ Union ............. *** other/dummy_empty ***
Here's the script to process that data:
#!/usr/bin/env perl use strict; use warnings; use autodie; use File::Spec; use List::Util 'first'; { my $union_file = 'union.txt'; my @dirs = qw{agreements other}; my $unions = get_unions($union_file); print "Unions to check:\n"; print "\t$_\n" for @$unions; process_files($_, $unions) for @dirs; } sub get_unions { my ($union_file) = @_; open my $fh, '<', $union_file; my @unions; while (<$fh>) { chomp; y/ / /s; push @unions, $_; } return [ sort { length $a <=> length $b } @unions ]; } sub process_files { my ($dir, $unions) = @_; print "Prcessing directory: $dir\n"; opendir(my $dh, $dir); for (grep /\.(?:txt|pdf)\z/, readdir $dh) { my $path = File::Spec::->catfile($dir, $_); print "\tProcessing path: $path\n"; my $text = do { open my $fh, '<:crlf', $path; local $/; <$fh> +}; $text =~ y/ \n/ /s; my $found = first { -1 < index $text, $_ } @$unions; if (defined $found) { print "\t\tMATCH: $found\n"; } else { print "\t\tNo matches found.\n"; } } return; }
Here's the output:
Unions to check: ABC Union XYZ Union Prcessing directory: agreements Processing path: agreements/abc.txt MATCH: ABC Union Processing path: agreements/abc_xyz.pdf MATCH: ABC Union Processing path: agreements/def.txt No matches found. Processing path: agreements/xyz.pdf MATCH: XYZ Union Prcessing directory: other
Take whatever ideas, or actual code, you want from that. I'd recommend you run Benchmarks to see what improvements you're making: probably also useful for the person who told you "... my codes aren't efficient ...".
— Ken
|
|---|