Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have about 190,000 HTML-files. Every one of those contains some HTML code in the end that I want to get rid of. How can I do that? I tried to solve this problem myself, but none of the solutions I came up with worked. No warnings, no errors, just no effect. For example
use warnings; use strict; use File::Slurp; local $/ = undef; for ($i=1; $i < 119000; $i++) {&s1;} sub s1 { open (fl, "+<", "C\:\\convs\\conv$i\.txt"); binmode fl; $string = <fl>; $string =~s/^.*Mina olen.{1561}//s; close(fl); }
What have I done wrong?

Replies are listed 'Best First'.
Re: Removing a chunk of HTML?
by almut (Canon) on Mar 21, 2010 at 21:40 UTC

    You're not writing out the data.  Doing the substitution on $string that you read from the file won't modify the file (even when opened with "+<"), just the string.

    Unless you have a backup of the files elsewhere, it's probably safest to write the modified content ($string) to another file, and only delete the original one when there weren't any errors.

    (P.S.: you're loading File::Slurp but don't seem to be using it...   Also, if you really had use strict in your code, it would complain 'Global symbol "..." requires explicit package name...' for $i and $string)

Re: Removing a chunk of HTML?
by kiruthika.bkite (Scribe) on Mar 22, 2010 at 05:58 UTC
    In your code you have just modified the string.So changes will not be affected in a file.
    If you want to modify file contents,use the module File::Inplace module.

    Example
    use strict; use warnings; use File::Inplace; my $editor = new File::Inplace(file=>"filename",suffix=>".bak"); while ($_=$editor->next_line) { if(s/(.*)/"welcome"/) { $editor->replace_line($_); } } $editor->commit;

    If we mention suffix then it will create the backup file with the given extension.
Re: Removing a chunk of HTML?
by Anonymous Monk on Mar 22, 2010 at 09:44 UTC
    First of all, do a test run for couple of files over there.

    Dont do it for all without testing your SCRIPT.
Re: Removing a chunk of HTML?
by Marshall (Canon) on Mar 23, 2010 at 03:05 UTC
    I wrote some code for you below. I didn't completely test it but it does illustrate some basic ideas.

    1. A loop like this, which is common in 'C' is seldom needed in Perl: for ($i=1; $i < 119000; $i++) because we have a foreach(@xyz){} iterator that "visits" all elements of @xyz without having to know the number in advance.

    2. There are a number of ways to get the files within a directory that match a pattern. Below I show the way needed with Active State Perl 5.6 within comments, but if you have say Perl 5.10 the glob() method below will work fine.(there are at least 3 variants of glob that I know of).

    3. The way that is the most safe when modifying a file is make a temp file, do your thing and then if all works ok, delete the original file and replace with the new file. There are actually even more safe ways than I've shown here for that. But this is good for 99% of cases.

    use warnings; use strict; # is one way to get the file names # I think here we can just use glob() if you are at Perl 5.10 # my $source_dir = "C:/convs"; # opendir (DIR, $source_dir) || die "unable to open $source_dir $!"; # @files = grep{m/conv\d+\.txt/}readdir DIR; my $source_dir = "C:/convs"; my @files = glob("$source_dir/conv*.txt"); foreach my $file (@files) { open (IN, "$source_dir/$file") || die "unable to open $source_dir/$file $!"; open (TEMP, "$source_dir/$file.tmp" || die "unable to open $source_dir/$file.tmp $!; while (<IN>) { s/^.*Mina olen.{1561}//s; #/s allows "." to match newline #I'm not sure that it is needed here. print TEMP $_; } close TEMP || die "$!"; #unlikely to fail (file is "open") close IN || die "$!"; unlink ("$source_dir/$file") || die "$!"; rename ("$source_dir/$file.tmp", "$source_dir/$file") || die "$!"; }