hmbscully has asked for the wisdom of the Perl Monks concerning the following question:

I needed to put a line of code into a bunch of html pages. I wanted to see exactly what files got changed, so I wrote this script:
#!/usr/bin/perl use warnings; use strict; use File::Find::Rule; #find all html files in specified directory my $dir = "/var/www/site/htdocs/"; my $rule = File::Find::Rule->file->name("*.html")->start( $dir ); #keep track of the changed files in a file open(OUTFILE,">>fixed_files.txt") || die "cant open fixed_files.txt, $ +!\n"; while ( my $html_file = $rule->match ) { rename($html_file, "$html_file.bak") or die; open(my $fh_in, '<', "$html_file.bak") or die; open(my $fh_out, '>', $html_file) or die; while (<$fh_in>) { #add the urchin code if (s|</head>|<script src="http://mysite.org/__utm.js" typ +e="text/javascript"></script>\n</head>|i) { print OUTFILE "$html_file: fixed Urchin code\n"; } print $fh_out $_; } close($fh_in); close($fh_out); } close OUTFILE;

Which worked like a charm. Unfortunately I came to find that I had a version control issue and some files had been updated with the code already and I didn't know it. So I ended up with a lot of pages with this:

<script src="http://mysite.org/__utm.js" type="text/javascript"></scri +pt> <script src="http://mysite.org/__utm.js" type="text/javascript"></scri +pt> </head>

I am trying to figure out how to remove the duplicate line from the files. I've tried many regexes with no success. My last idea was

#fix the urchin code if (s|<script src="http://mysite.org/__utm.js" type="text/ +javascript"></script>\n<script src="http://mysite.org/__utm.js" type= +"text/javascript"></script>|<script src="http://mysite.org/__utm.js" +type="text/javascript"></script>\n|i) { print OUTFILE "$html_file: fixed Urchin code\n"; }

Can someone point me in a more productive direction?
Thanks!


I learn more and more about less and less until eventually I know everything about nothing.

Replies are listed 'Best First'.
Re: Regex to undo a regex?
by johngg (Canon) on Feb 02, 2008 at 00:36 UTC
    A couple of thoughts.

    • The text you want to correct in the erroneous files spans more than one line. If you want to use a regex to correct the text you can't process the file line by line as you will never find a match. Instead, you would have to slurp the whole file into a single string so that you can do line spanning matches.

    • If you want to process by reading a line at a time from the uncorrected file and printing to a corrected file, perhaps because the files are large, then you could use a sort of state engine. Remember the last line printed. Don't print the current line if it is eq (string equality) to both the last line and the text (including the newline so you will probably need to use a double-quoting construct (qq{ ... }) to initialise it) you want to remove, otherwise print it.

    I hope these ideas help you.

    Cheers,

    JohnGG

    Update: Fixed typo, s/to/so/.

Re: Regex to undo a regex?
by hipowls (Curate) on Feb 02, 2008 at 00:24 UTC

    If I understand the problem correctly you are trying to eliminate duplicate lines that match a pattern where the lines immediately follow each other.

    This code will open a file called file.html and if there are two or more lines in succession that are identical and contain the text sample then the duplicates are removed and file.html is rewritten.

    use IO::String; my $regex = qr/sample/; my $file = 'file.html'; open my $in , '<', $file or die "Can't open $file: $!\n"; my $new_html; my $new = IO::String->new($new_html); my $last = ''; my $modified; while ( my $line = <$in> ) { if ( $line =~ /$regex/ ) { if ($line ne $last) { print {$new} $line; } else { ++$modified; } } else { print {$new} $line; } $last = $line; } close $in; close $new; if ($modified) { rename $file, "$file.bak" or die "Can't rename $file: $!\n"; open my $out, '>', $file or die "Can't open $file: $!\n."; print {$out} $new_html or die "Can't print to $file: $!\n." }

    Update: Just read shmem's rather neat answer and updated my code to make a backup if the file is modified.

Re: Regex to undo a regex?
by shmem (Chancellor) on Feb 02, 2008 at 00:27 UTC
    No need for a regex to remove duplicate lines:
    perl -ni.bak -e 'print unless $_ eq $old; $old = $_' files

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}