G'day stray_tachyon,

There's a number of things you can do to improve this code. Here's a list of some of the things I found.

I put together a short script to show how all of those points might be implemented. I also dummied up some highly contrived data just to give the script something to work on.

Here's union.txt (note the second line has an extra space):

$ cat union.txt ABC Union XYZ Union

I then created a number of very short files in two directories. These have one, two or no matches; one has a name spread over two lines with a slew of extra whitespace; one file is completely empty.

$ for i in agreements other; do for j in `ls $i`; do echo "*** $i/$j * +**"; cat $i/$j; done; done *** agreements/abc.txt *** ... ABC Union ... *** agreements/abc_xyz.pdf *** ... XYZ Union and ABC Union ... *** agreements/def.txt *** ... DEF Union ... *** agreements/pqrpdf *** ... temp data ... *** agreements/xyz.pdf *** .................. XYZ Union ............. *** other/dummy_empty ***

Here's the script to process that data:

#!/usr/bin/env perl use strict; use warnings; use autodie; use File::Spec; use List::Util 'first'; { my $union_file = 'union.txt'; my @dirs = qw{agreements other}; my $unions = get_unions($union_file); print "Unions to check:\n"; print "\t$_\n" for @$unions; process_files($_, $unions) for @dirs; } sub get_unions { my ($union_file) = @_; open my $fh, '<', $union_file; my @unions; while (<$fh>) { chomp; y/ / /s; push @unions, $_; } return [ sort { length $a <=> length $b } @unions ]; } sub process_files { my ($dir, $unions) = @_; print "Prcessing directory: $dir\n"; opendir(my $dh, $dir); for (grep /\.(?:txt|pdf)\z/, readdir $dh) { my $path = File::Spec::->catfile($dir, $_); print "\tProcessing path: $path\n"; my $text = do { open my $fh, '<:crlf', $path; local $/; <$fh> +}; $text =~ y/ \n/ /s; my $found = first { -1 < index $text, $_ } @$unions; if (defined $found) { print "\t\tMATCH: $found\n"; } else { print "\t\tNo matches found.\n"; } } return; }

Here's the output:

Unions to check: ABC Union XYZ Union Prcessing directory: agreements Processing path: agreements/abc.txt MATCH: ABC Union Processing path: agreements/abc_xyz.pdf MATCH: ABC Union Processing path: agreements/def.txt No matches found. Processing path: agreements/xyz.pdf MATCH: XYZ Union Prcessing directory: other

Take whatever ideas, or actual code, you want from that. I'd recommend you run Benchmarks to see what improvements you're making: probably also useful for the person who told you "... my codes aren't efficient ...".

— Ken


In reply to Re: How to efficiently search for list of strings in multiple files? by kcott
in thread How to efficiently search for list of strings in multiple files? by stray_tachyon

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.