ybiC has asked for the wisdom of the Perl Monks concerning the following question:

I threw together the following scriptlet to clean up some text files, but now I'd like to add 2 minor refinements.
  1. list the wah, blah, yada, and blankline regex's separately, like a config section.   There's not necessarily 3, it varies from file to file.
  2. read infile and outfile names from STDIN

No doubt this is elementary to the many "real programmers" among the PM bretheren.   The PM Tutorials suggest that a hash may do #1, but I'm at a loss on quite how to implement.

Thanks in advance for any direction or code examples.
    cheers,
    ybiC

#!/usr/bin/perl -w use strict; my $infile = "/dir/infilename"; my $outfile = "/dir/outfilename"; open (IN, "$infile") or die "Error opening $infile: $!\n"; open (OUT, ">$outfile") or die "Error opening $outfile: $!\n"; while (<IN>) { s/wahwah//g; # unwanted text s/blahblah//g; # more unwanted text s/yadayada//g; # still more unwanted text s/^\s+//g; # blank lines print OUT $_ or die "Error writing to $outfile: $!\n"; } close (IN) or die "Error closing $infile: $!\n"; close (OUT) or die "Error closing $outfile: $!\n"; # END

Replies are listed 'Best First'.
Re: elementary hash
by ncw (Friar) on Sep 12, 2000 at 01:13 UTC
    You are treading dangerously close to 1 liner territory here! I shall resist the temptation, though I predict others won't ;-)

    Here is how I would do it :-

    #!/usr/bin/perl -w -p -i.bak use strict; my $kill = join "|", qw{ wahwah blahblah yadayada }; while (<>) { s/$kill//go; # remove unwanted strings s/^\s+//g; # blank lines }
    Pass a list of files you want changed in on the command line - this will edit them in place (the -i flag) saving the old ones with a .bak extension.

      The final working version I came up with was:
      #!/usr/local/bin/perl -w -p -i.bak use strict; my $kill = join "|", qw{ wahwah blahblah yadayada }; s/$kill//go; # remove unwanted strings s/^\s+//g; # blank lines
      Without the -p option the working version was:
      #!/usr/local/bin/perl -w -i.bak use strict; my $kill = join "|", qw{ wahwah blahblah yadayada }; while (<>) { s/$kill//go; # remove unwanted strings s/^\s+//g; # blank lines print; }
      I've not used the -p option before, but -p forces printing at the end, while -n explicitly disables printing. -p will override -n, if both are present. 'perldoc perlrun' talks about these.

      --Chris

      e-mail jcwren
      Here's a fun one-liner:

      perl -pi.bak -e 's/(^\s+$)?|blahblah|wahwah|yadayada//g' <filename>

      If you run it twice, it'll squash blank lines. The regex could be more robust, but I hadn't seen any one-liner yet.

      Update: I still don't understand tilly's followup, but let me explain what I mean a little further. My one-liner strips out the words mentioned above, but retains the line-ending newlines. tilly's dual-regex below fixes that. If you run mine twice, it'll get rid of the newlines (as the first part of the regex squashes whitespace). It has the (unlikely) side effect of getting rid of a string that, in the original file, would resemble 'blahyadayadablah'.

      Not that you should be doing this with a one-liner.... :)

        Actually that doesn't match the original code. If the entire text of the original was one of the matches, then it should be squashed. Here is a one-liner that works and is shorter, but only because of what the strings are:
        perl -pi.bak -e 's/(blah|wah|yada)\1//g;s/^\s+$//' $file
      Hmmm... responds with "Use of uninitialized value at line while (<>) {".   The .bak file is created, but the original is left as a zero-byte file.

      Perl 5.005.03
          cheers,
          ybiC

        I dumped the -p switch and put an explicit print statement in and that worked...but it seems like the -p _ought_ to work. I _think_ it doesn't, though, because the Camel book says -p puts an assumed loop around the script with a print in the equivalent of a continue block, so the print would only be seen once. Now, if you remove the while loop and just do this:
        #!/usr/bin/perl -p -i.bak my $kill = "what|ever"; s/$kill//go; s/^\s+//g;
        I believe it'll work..it's weird, but it works.
Re: elementary hash
by adamsj (Hermit) on Sep 12, 2000 at 01:08 UTC
    For #2, do this:

    On the command line, put your input and output files. Then do:

    my $infile = shift; my $outfile = shift;

    I'm a little at a loss as to what you want to do with the regexes--can you amplify a little bit?

      The regex's are stripping unwanted text from each line of the input file.   Basically, I'm cleaning up logfiles to make them easier to read at a glance.
          cheers,
          ybiC
        The answer ncw gave on how to join the patterns into a regex is just right--you have nice, simple patterns.

        Are you interesting in something that applies one regex to a one sort of log file and another regex to another sort? If so, you're nearly there. If you can parse the input filenames (or the output filenames, come to think of it) to get a unique string corresponding to each type of file you want to edit, then you could put the pattern into a hash based on the string you parsed out of the file names. If you don't have many types of files, that's a good solution.

        If you have a _lot_ of types of files, or if the regex is going to change a lot, I'd consider biting the bullet and writing a config file. Then, once you've parsed the file name, get the regex for that type of file from the config.