wrkrbeee has asked for the wisdom of the Perl Monks concerning the following question:

Hi PerlMonks, I'm working to create a loop that will read all text (.txt) files in a directory, excluding any sub-directories. Simple enough, think I have the pattern match, but not enough knowledge to incorporate it into the IF or NEXT UNLESS condition. Once the files are read, I'm hoping to strip HTML tags with HTML::STRIP. I am grateful for any insight you may have. Here is the code I am working with:

#! /usr/bin/perl -w use strict; use warnings; use lib "c:/strawberry/perl/site/lib"; use HTML::Strip; my $hs = HTML::Strip->new(); my $write_dir = 'G:\research\sec filings 10k and 10Q\data\filing docs\ +1993\Clean'; my $files_dir = 'C:\Dwimperl\Perl\1993'; opendir (my $dir_handle, $files_dir) || die "failed to open '$files_di +r' <$!>"; while (my $file = readdir($dir_handle) ) { next if $file eq '.' or $file eq '..';# or $file =~ /[0-9|-]+\.txt +$/; open my $file_handle, "/dwimperl/perl/1993/$file" or die "failed t +o open '$file' <$!>"; foreach my $line (<$file>) { my $clean_text = $hs->parse( ' ' ); print $write_dir "$file\n"; $hs->eof; } } close(); closedir $dir_handle;

Replies are listed 'Best First'.
Re: Read files not subdirectories
by BrowserUk (Patriarch) on Jan 29, 2015 at 21:36 UTC

    To get a list of all the .txt file in the current directory, use:

    my @files = <*.txt>;

    To process every line of those files one at a time, use:

    { local @ARGV = <*.txt>; while( <> ) { // $_ contains the lines of the files one at a time; for each +file in turn } }

    Update:To display file(line no): line for every .txt file in the current directory use:

    { local @ARGV = <*.txt>; while( <> ) { chomp; print $ARGV, '(', $., '):', $_; } }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      wow. never realized all such things at glance! this is DynamitePerl (maybe it is worth a new section.. ;=) ).
      Anyway, with the excuse of explaining your code to the OP, i'll try to describe it further with the hope to retain it in memory..
      my @files = <*.txt>;
      This is almost easy, but not completely. The clue is in a big section of perlop concerning IO Operators:
      while speaking about diamond operator <>

      If what's within the angle brackets is neither a filehandle nor a simple scalar variable containing a filehandle name, typeglob, or typeglob reference, it is interpreted as a filename pattern to be globbed, and either a list of filenames or the next filename in the list is returned, depending on context.

      So,if no all glitter is gold, not all diamonds contain filehandle..
      The keyword in the above sentece result to be glob and in his own manpage is stated:
      This is the internal function implementing the <*.c> operator, but you can use it directly. If EXPR is omitted, $_ is used.

      This lead to the second code:
      #To process every line of those files one at a time, use: { local @ARGV = <*.txt>; while( <> ) { # $_ contains the lines of the files one at a time; for each f +ile in turn # To display file(line no): line for every .txt file in the cu +rrent directory use: chomp; print $ARGV, '(', $., '):', $_; } }
      Docs state:
      The null filehandle <> is special: it can be used to emulate the behavior of sed and awk, and any other Unix filter program that takes a list of filenames, doing the same to each line of input from all of them.

      ..

      Here's how it works: the first time <> is evaluated, the @ARGV array is checked,..

      ..read also the missing part..

      You can modify @ARGV before the first <> as long as the array ends up containing the list of filenames you really want.


      So the first diamond is used on a (localized!) @ARGV to fill it whit a list of files created via glob and then the second diamond is that special one.

      HtH
      L*
      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Thanks BrowserUk! Stupid question: your suggestions replace my WHILE NEXT statements, right? If so, then how do your suggestions affect my OPEN and FOREACH statements? Sorry for the low level questions.
Re: Read files not subdirectories
by Laurent_R (Canon) on Jan 29, 2015 at 22:54 UTC
    To discard directories from your file list and keep only plain files:
    next unless -f $file;
    Please also note that it is not the best to use for or foreach to read the lines of a file, because this implies making a copy of the full file into memory before starting to process it. If the file is very big, the program might just crash because of memory overflow. This is also likely to slow down things slightly, although I am not sure this will make a big or even noticeable difference. The better way to read a text file is to use the while loop operator, for example as follows:
    while (my $line = <$file>) { # ... }
    With the while statement, you are reading the file line by line, so that you only have one line in memory at any given time.

    Je suis Charlie.
      That's perfect!! Also, thanks for your advice concerning the use of WHILE in lieu of FOREACH! :-))

      Could I please ask another question? After using "next unless -f $file" the program runs, but fails to execute anything thereafter. As a test, I inserted a simple PRINT statement immediately after the "next unless" statement, and received nothing. If I uncomment the "next if" statement, and omit the "next unless", then the simple PRINT statement works, but the program crashes trying to execute the write statement. In sum, it seems that the "next unless" filters out all obs. Make any sense?

      #! /usr/bin/perl -w use strict; use warnings; use lib "c:/strawberry/perl/site/lib"; use HTML::Strip; my $hs = HTML::Strip->new(); my $write_dir = 'G:\research\sec filings 10k and 10Q\data\filing docs\ +1993\Clean'; my $files_dir = 'C:\Dwimperl\Perl\1993'; opendir (my $dir_handle, $files_dir) || die "failed to open '$files_di +r' <$!>"; while (my $file = readdir($dir_handle) ) { next unless -f $file; #next if $file eq '.' or $file eq '..'; open my $file_handle, "/dwimperl/perl/1993/$file" or die "failed t +o open '$file' <$!>"; while (my $line = <$file>) { my $clean_text = $hs->parse( ' ' ); print $write_dir "$file\n"; $hs->eof; } } close(); closedir $dir_handle;

        Consult a beginner level Perl book ("Beginner Perl" for an example) to understand difference between file and file handle; currently selected file handle for print & its various forms.

        ... my $write_dir = 'G:\research\sec filings 10k and 10Q\data\filing docs\ +1993\Clean'; ... opendir (my $dir_handle, $files_dir) || die "failed to open '$files_di +r' <$!>"; while (my $file = readdir($dir_handle) ) { ... open my $file_handle, "/dwimperl/perl/1993/$file" or die "failed +to open '$file' <$!>"; while (my $line = <$file>) {

        Actually use the file handle, not a file path, to read a line.

        ... print $write_dir "$file\n"; ...

        The directory path is not a file handle but a string. If there is none such open file handle, print will fail. To write to a file for a specific file handle, open the file in write mode; use print FILEHANDLE LIST syntax; see print.

        To copy or move files, see File::Copy.

        Thank you! Apologize for the inconvenience.

        On many systems, doing something to a file ... even, just opening it ... can interfere with a directory-scan, causing it to end prematurely, to list the same file more than once, and so on.   (And this would be true no matter what high-level language e.g. Perl was being used to do it.)

        Therefore, I suggest that you first retrieve the entire list of files into an in-memory list ... which you can very easily do in Perl just by using the list context.   Then, iterate through the in-memory list that you have just retrieved, checking to see if they are or aren’t directories and so-on.   Start and finish the task of retrieving the list, for any given directory that you are now “in” ... then process the list.

        Of course, “file finding” is such a common requirement that there are many CPAN modules like File::Find.   If you need to “take a walk through a directory tree,” there are plenty of tour-guides . . .

      Could I ask another question, please? The code below runs, but fails to write/save the HTML-stripped text files. With a simple print statement, I've determined that the "second" WHILE statement must return FALSE, as the program never makes it this far. I am grateful for any insight!

      #! /usr/bin/perl -w use strict; use warnings; use lib "c:/strawberry/perl/site/lib"; use HTML::Strip; my $hs = HTML::Strip->new(); #Where I will store the end results; my $write_dir = 'G:\research\sec filings 10k and 10Q\data\filing docs\ +1993\Clean'; #Where the files with the HTML tags are located; my $files_dir = 'C:\Dwimperl\Perl\1993'; #Open the directory where the target files with HTML tags are located; + #Why am I doing this? Stores file names in a directory handle? opendir (my $dir_handle, $files_dir) || die "failed to open '$files_di +r' <$!>"; #Loop through each entry/file in the directory; #What is readdir doing here? It's not really reading anything; #Is it simply advancing us to the next entry?; #Seems like the real READ occurs via the OPEN statement below; while (my $file = readdir($dir_handle) ) { next unless -f $file; #next if $file eq '.' or $file eq '..'; #Open the current file so I can strip the HTML tags ??? ; open my $file_handle, '<', $file or die "failed to open '$file' <$ +!>"; #Read the current file one line at a time??; while (my $line = <$file_handle>) { ########The WHILE statement above must return FALSE cuz the program ne +ver makes it here; #Strip the HTML tags??; my $clean_text = $hs->parse( ' ' ); #Save the clean (no HTML tags) text file in a new file/locatio +n??; print $write_dir "$file\n"; $hs->eof; } } close(); closedir $dir_handle;

        Is your script located in the same folder as the html files ?. If not add the directory to get the full path like this

        #!perl use strict; use warnings; my $files_dir = 'C:\Dwimperl\Perl\1993'; opendir (my $dir_handle, $files_dir); while (my $filename = readdir($dir_handle)){ next unless -f $files_dir.'/'.$filename; print "$filename\n"; }
        poj