stray_tachyon has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl masters, I am trying to lookup the existence of a list of unions in a bunch of collective agreements. I was told my codes aren't efficient because they are going through an array one by one. I try to use these two lines

my ( $first_match ) = grep { $line =~ m/\Q$_\E/ } keys %InputData_Empl +oyersHash; print $search and last SEARCH if $first_match ne "";
(modified from https://www.perlmonks.org/?node_id=1183136) but they don't work. Thank you very much

Here's my working code:

read_in_data_union(); process_files ($FileFolder); sub read_in_data_union { open(my $fh, '<:crlf', $DataFile_Unions) or die "Could not open file '$DataFile_Unions' $!"; chomp(@InputData_Unions = <$fh>); } sub process_files { my $path = shift; opendir (DIR, $path) or die "Unable to open $path: $!"; my @files = grep { /.pdf.txt/ } readdir (DIR); closedir (DIR); @files = map { $path . '/' . $_ } @files; foreach my $file (@files) { open (my $txtfile, $file) or die "error opening $file\n"; print $FileHandle "$file"; LookForUnion: { print $FileHandle "\t"; while (my $line=<$txtfile>) { foreach (@InputData_Unions) { if ($line =~ /\Q$_/i) { print $FileHandle "$_"; last LookForUnion; } } } } } }
My union.txt file sample:
Sarnia Employees' Bargaining Association Sarnia Municipal Administrative Employees' Assn. Sarnia Police Association Sarnia Professional Fire Fighters'
and the collective agreements are just converted text files from PDFs Sample:
McMaster University (The "Employer") and Service Employees International Union, Local 2 BGPWU ("SEIU") Representing Hospitality Services Staff ________________________________________ COLLECTIVE AGREEMENT

Replies are listed 'Best First'.
Re: How to efficiently search for list of strings in multiple files?
by kcott (Archbishop) on Aug 18, 2018 at 09:00 UTC

    G'day stray_tachyon,

    There's a number of things you can do to improve this code. Here's a list of some of the things I found.

    • Always use strict and warnings. See "perlintro -- Perl introduction for beginners" for further discussion.
    • Use lexical variables, not package variables. Use them in the smallest possible scope. They're faster and you won't suffer from any of the classic problems associated with global variables.
    • Order your list of unions. If you have some domain-specific knowledge regarding which are more likely to be found, put them first; if not, put the shortest strings first because it's quicker to match a short string than a longer one.
    • Canonicalise your data. It's quite possible there's extraneous whitespace in the union names; long names may be spread over two lines in the agreements; someone may have failed to capitalise part of a name. Whatever you do to the agreement text, do the same to the list of unions.
    • Avoid regexes wherever possible. Perl's string handling functions, such as index, are typically, measurably faster than a regex.
    • If you're only looking for the first match, don't use grep; List::Util's first function is likely a much better choice.
    • Given you're dealing with .txt and .pdf files, these are probably relatively small and would easily fit into memory. Slurp the entire text of each file into a single string and perform your matches on that. (See the special variable $/.) If any names are spread over multiple lines, attempting to match on a line-by-line basis will be pointless.
    • Check your work as you go. Use print statements to check that variables contain reasonable values. Your regex /.pdf.txt/ is quite clearly not doing what you want: a print statement showing that the array @files was empty would have told you that.

    I put together a short script to show how all of those points might be implemented. I also dummied up some highly contrived data just to give the script something to work on.

    Here's union.txt (note the second line has an extra space):

    $ cat union.txt ABC Union XYZ Union

    I then created a number of very short files in two directories. These have one, two or no matches; one has a name spread over two lines with a slew of extra whitespace; one file is completely empty.

    $ for i in agreements other; do for j in `ls $i`; do echo "*** $i/$j * +**"; cat $i/$j; done; done *** agreements/abc.txt *** ... ABC Union ... *** agreements/abc_xyz.pdf *** ... XYZ Union and ABC Union ... *** agreements/def.txt *** ... DEF Union ... *** agreements/pqrpdf *** ... temp data ... *** agreements/xyz.pdf *** .................. XYZ Union ............. *** other/dummy_empty ***

    Here's the script to process that data:

    #!/usr/bin/env perl use strict; use warnings; use autodie; use File::Spec; use List::Util 'first'; { my $union_file = 'union.txt'; my @dirs = qw{agreements other}; my $unions = get_unions($union_file); print "Unions to check:\n"; print "\t$_\n" for @$unions; process_files($_, $unions) for @dirs; } sub get_unions { my ($union_file) = @_; open my $fh, '<', $union_file; my @unions; while (<$fh>) { chomp; y/ / /s; push @unions, $_; } return [ sort { length $a <=> length $b } @unions ]; } sub process_files { my ($dir, $unions) = @_; print "Prcessing directory: $dir\n"; opendir(my $dh, $dir); for (grep /\.(?:txt|pdf)\z/, readdir $dh) { my $path = File::Spec::->catfile($dir, $_); print "\tProcessing path: $path\n"; my $text = do { open my $fh, '<:crlf', $path; local $/; <$fh> +}; $text =~ y/ \n/ /s; my $found = first { -1 < index $text, $_ } @$unions; if (defined $found) { print "\t\tMATCH: $found\n"; } else { print "\t\tNo matches found.\n"; } } return; }

    Here's the output:

    Unions to check: ABC Union XYZ Union Prcessing directory: agreements Processing path: agreements/abc.txt MATCH: ABC Union Processing path: agreements/abc_xyz.pdf MATCH: ABC Union Processing path: agreements/def.txt No matches found. Processing path: agreements/xyz.pdf MATCH: XYZ Union Prcessing directory: other

    Take whatever ideas, or actual code, you want from that. I'd recommend you run Benchmarks to see what improvements you're making: probably also useful for the person who told you "... my codes aren't efficient ...".

    — Ken

Re: How to efficiently search for list of strings in multiple files?
by Laurent_R (Canon) on Aug 17, 2018 at 21:18 UTC
    Your code is inefficient because of the nested loops here:
    while (my $line=<$txtfile>) { foreach (@InputData_Unions)
    But it is not clear whether you can easily get rid of these nested loops.

    One typical way to get rid of nested loops is to store keywords in a hash and have a hash lookup instead of a sequential list search, but that doesn't quite work here (at least not easily) with your Unions list containing several words.

    Depending on how large your list of unions is, you could possibly build a regex with alternates for each union, something like:

    if ($line =~/Sarnia Police Association|Sarnia Professional Fire Fighte +rs|.../) { # ...
    With the data you've shown, I would start by excluding any line that doesn't contain the word "Sarnia":
    while (my $line = <$txtfile>) { next unless $line =~/Sarnia/; foreach (@InputData_Unions) { # ...
    This would probably make your program very significantly faster, but it may be that looking for "Sarnia" is not really applicable to your real case.

    My point, though, is that the more you know about your data, the more you're likely to find some improvements or shortcuts.

Re: How to efficiently search for list of strings in multiple files?
by clueless newbie (Curate) on Aug 17, 2018 at 22:28 UTC

    Change the number of regexes that you need to one via Regexp::Assemble reducing the inner loop from many to 1?

    use Regexp::Assemble; my $ra = Regexp::Assemble->new; $ra->add( q{Sarnia Employees' Bargaining Association} ); $ra->add( q{Sarnia Municipal Administrative Employees' Assn} ); $ra->add( q{Sarnia Police Association} ); $ra->add( q{Sarnia Professional Fire Fighters} ); print $ra->re;

    which yields a single regex:

    (?^:Sarnia (?:(?:Municipal Administrative Employees' Ass|Employees' Bargaining Associatio)n|P(?:rofessional Fire Fighters|olice Association)))

    which you could always feed to ack.

    Changed to q{} per AnomalousMonk's post. Thanks AnomalousMonk!

      $ra->add( 'Sarnia Employees' Bargaining Association' );

      It's hard to see how single-quoted strings containing unescaped single-quotes could compile. Could you please fix your example code?


      Give a man a fish:  <%-{-{-{-<

Re: How to efficiently search for list of strings in multiple files?
by thanos1983 (Parson) on Aug 17, 2018 at 16:57 UTC
      I don't think the example in "How to Recursively search down thru directories to find a string and display its location." is more efficient in my case. The example take a single word and go through each file line by line to look for that word, move to the next file when the word is found. It is the same as what I'm doing, except I go through a long list of words

        Hello again stray_tachyon,

        As I said use File::Find::Rule and convert it to your way of fitting. I also prefer to use the module IO::All that will help you to load the file in an array so you can grep all the words in one go. Why I think this approach is better? You check only specific directories for specific files. You do not need to know the name of the file just the extension. So in conclusion for me filtering out the files that you need to open is more efficient.

        From my point of view this is more efficient, maybe another Monk has a better idea. But at the end why to relay on people opinions when you simply can use Benchmark::Forking if you are on LinuxOS or WindosOS Benchmark.

        Sample of code:

        #!/usr/bin/perl use strict; use IO::All; use warnings; use Data::Dumper; use File::Find::Rule; my @LogDirs = ('/home/user/PerlMonks/', '/tmp/Monks'); # add more here my $level = shift // 3; # level to dig into my @files = File::Find::Rule->file() ->name( '*.txt', '*.log' ) #can insert regex too ->maxdepth($level) ->in(@LogDirs); my %hash; foreach my $file (@files) { my @lines = io($file)->chomp->slurp; my $matches = grep {/Start/ or /middle/ or /end/} @lines; $hash{$file} = $matches; } print Dumper \%hash; __END__ $ perl test.pl $VAR1 = { '/home/user/PerlMonks/Foo/Bar/monks.log' => 1, '/home/user/PerlMonks/sample.txt' => 1, '/tmp/Monks/SampleDir/monks.log' => 2 };

        Sample of data in files:

        $ cat /tmp/Monks/SampleDir/monks.log Start line key line. Ignore this line. Another line key end. $ cat /home/user/PerlMonks/Foo/Bar/monks.log This is line 1. This key line end. $ cat /home/user/PerlMonks/sample.txt This is another key line middle point. I do not care about this.

        Hope this helps, BR.

        Seeking for Perl wisdom...on the process of learning...not there...yet!