NorthShore44 has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks - I am trying to write a program that will read in a directory of txt files, search through each file sequentially for text matches, and subsequently output the text matches to other txt files. If the script does find matches, I would like it to copy the text to a filename that would match the file from which it came. So far, I've gotten the text search to work, and print matches to one file, not individual files. But I can't seem to get the loop to work. To do this, I've done a $File::Find::name of the directory and outputted the list of txt files into a text file, which I've then read as an array. Here's my code - if you can give me some help, it would be much appreciated. Thanks!
#! /opt/local/bin/perl -w # Part 1: Write all of the filenames to Index.txt to be able to inpu +t names as an array # open directory1, ">./index.txt" or die "Could not create file: $!\n"; use File::Find; my @all_file_names; find sub { return if -d; push @all_file_names, $File::Find::name; }, '*my directory*'; for my $path ( @all_file_names ) { select directory1; print "$path\n"; } print STDERR "contents of directory written \n"; close directory1; ################################################## # Part 2: Open & Read Contract List + # ################################################## my $filename = 'index.txt'; open my $fh, $filename or die "Couldn't read '$filename': $!"; #chomp @ARGV = <$fh>; print "Processing $_" for @ARGV; ################################################## # Part 3: Read in the contracts and search for keywords + # ################################################## open Contract1a, ">./test.txt" or die "Could not create file: $!\n"; @contracts = @ARGV; $linescount = 20; #set how many lines after the match you'd like t +o pull; foreach (@contracts) { open Contract1, "<./$_ \n"; @lines = <Contract1>; close Contract1; my $contents = join "", @lines; @all = undef; shift @all; while ($contents =~ m/Management Fees/i) { if ($contents =~ m/(\n.*Management Fees(.*\n){$linescount})/i) { push (@all, "$1\n"); print STDERR "$1\n"; $contents =~ s/Management Fees//i; } } select Contract1a; print join("\n",(@all)); print "\n"; } print STDERR "...Contract1 match complete.\n\n"; close Contract1a;>
  • Comment on Reading a Directory of Files, Searching for Text, Outputting Matches
  • Download Code

Replies are listed 'Best First'.
Re: Reading a Directory of Files, Searching for Text, Outputting Matches
by cdarke (Prior) on Oct 13, 2009 at 16:07 UTC
    I'm not sure I can provide a complete solution, since I have no idea what is comming in on the command line, and what the data is.

    You get the filenames into @all_file_names but all you do is to write those to index.txt, you do not appear to do anything else with the files - or maybe I misunderstand.

    However you have some strange constructs which tidying up and might make your code clearer.
    use strict;
    There are several variables that are used inside the main loop. Are these supposed to be globals? Declaring them with my will make their scope clearer (that might solve part of your problem).

    In a couple of places you have code like this:
    select directory1; print "$path\n";
    which is rather unnecessary. It would be simpler to:
    print directory1 "$path\n";
    Be careful of your opens:
    open Contract1, "<./$_ \n"; # What's with the space and new-line? my @lines = <Contract1>; close Contract1; my $contents = join "", @lines;
    Perl will look in the current directory anyway (no need for ./), and you should always test the open. Could be written as:
    open Contract1, '<', $_ or die "Unable to open $_: $!"; local $/ = undef; # slurp mode my $contents = <Contract1>; # Hope the files are small! close Contract1;
    You place an undef element onto the array, then shift it off:
    @all = undef; shift @all;
    Would be simpler if you did this:
    my @all;
    Which you would have done if you use strict;

    I'm not sure that you actually need to do a multi-line match and store everything, but then again I don't know what the data looks like.
Re: Reading a Directory of Files, Searching for Text, Outputting Matches
by gmargo (Hermit) on Oct 13, 2009 at 17:31 UTC

    I interpret your code as searching through these contract files for the "Management Fees" string, and then printing that line and the next 19 lines. If this is true then I think you're making it too difficult by slurping the entire file into a string, and then iteratively searching through that string, each time from the beginning.

    If my interpretation is correct, here is a bit of code that does the same thing but processes the files line-by-line. Also it has code, which I think you requested, creating different output files for each input file.

    foreach my $contract (@contracts) { open (Contract1, "<", $contract) || die("Cannot open input file $contract: $!"); my @all; while (<Contract1>) { if (/Management Fees/i) { # Print this line push @all, $_; # And the next 19 lines too for (my $i=0; $i<($linescount-1); $i++) { my $line = <Contract1>; last if !defined $line; push @all, $line; } } } close Contract1; # Base output filename on original filename. my $outfile = "$contract".".output"; open (Contract1a, ">", $outfile) || die("Cannot open output file $outfile: $!"); print Contract1a join("\n",(@all)); close Contract1a; }
Re: Reading a Directory of Files, Searching for Text, Outputting Matches
by planetscape (Chancellor) on Oct 13, 2009 at 22:35 UTC
Re: Reading a Directory of Files, Searching for Text, Outputting Matches
by stonecolddevin (Parson) on Oct 13, 2009 at 20:58 UTC

    (planetscape is going to kill me for improperly formatting this, but I'm too lazy to turn off TinyMCE)

    Why don't you check out KinoSearch? You'll have a full index of words to search, and It'll allow you to focus on the other business logic, like writing the results you want out to wherever you choose.

    mtfnpy

Re: Reading a Directory of Files, Searching for Text, Outputting Matches
by afoken (Chancellor) on Oct 13, 2009 at 20:55 UTC

    It seems you are searching for a combination of find, xargs, grep and simple output redirection. No perl needed:

    find /some/where/in/the/wild -name '*.txt' -print0 | xargs -0 grep -A3 + 'needle_in_the_haystack' > result.txt

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)