JohnMcPherson has asked for the wisdom of the Perl Monks concerning the following question:

What I need is a perl script that will, first of all, open each text file in a particular folder and search the files for a particular string. However, there is some HTML coding in the text files files. So, before it searches the text files, it needs to strip the file of as much of that coding as possible. Last, I need the script to output all file names of text files which had the string in it into an array. Also, I need the file names to be added to the array without any subdirectories or extensions.

Replies are listed 'Best First'.
Re: Searching text files with strings
by IlyaM (Parson) on Nov 29, 2001 at 09:50 UTC
    To find all files in particular folder use File::Find

    To strip HTML coding and find if file matches substring you can use HTML::TokeParser. Untested code ($SUBSTR - required string)

    require HTML::TokeParser; while $file (@files) { my $p = HTML::TokeParser->new(file); my $match = 0; while(my $token = $p->get_token) { next unless @$token[0] eq 'T'; if(index @$token[1], $SUBSTR) { $match = 1; last; } } if($match) { print $file, "\n"; } }

    To get filenames without subdirectories or extensions use File::Basename.

Re: Searching text files with strings
by Fastolfe (Vicar) on Nov 30, 2001 at 04:15 UTC

    What exactly is your question? You've dumped a bunch of requirements on us, but you haven't asked us anything.

    No offense, but we aren't here to crank out Perl code matching your requirements. We're here to share knowledge and answer questions. If you want someone to code for you, please provide a credit card number.

Re: Searching text files with strings
by r3b3lxd (Initiate) on Nov 30, 2001 at 01:41 UTC
    Try the following:
    use File::Basename; { local $/; opendir(DIR, $dirname) or die "Can't open directory $dirname: $!"; while (defined($file = readdir(DIR))) { ($base, $dir, $ext) = fileparse($dirname . "/" . $file,'\..*'); if ($file ne "." && $file ne "..") { #-T) { open(FILE, $dirname."/".$file) or die "Couldn't open $file: $!"; $plain_text = <FILE>; $plain_text =~ s/<[^>]*>//gs; if ($plain_text =~ /$search_string/g) { push(@matched_files, $base); } close(FILE); } } closedir(DIR); } # @matched_files contains all files that had $search_string in them
    Set $dirname to the path of the directory you want to search in and $search_string to the string you want to search for. Note: There is a problem with this script in that it will not strip html properly if there are nested tags for example the tag:
    <img src="img.gif" alt=" Look at this >>>> ">
    wouldn't be stripped properly. If this isn't going to be a problem though then the solution above should work fine. Good luck! Rob