cromiumlake has asked for the wisdom of the Perl Monks concerning the following question:

Hello Masters Of the Universe, look, this is the minimal expression from the exact bit where I'm getting the conflict:
#----------------------------------- #!/usr/bin/perl use strict; use warnings; # use re 'debug'; my $workingfolder = '/home/crom/Documents/perl_scripts/html_cleaner/te +st'; my $file_name = '555 timer.html'; my $link = '<a.*(?=<\/a>)<\/a>'; my $re_link = qr/$link/; open(FILEVAR, "<$workingfolder/$file_name") or die "cannot open $file_ +name: $!"; my @html = ''; while(<FILEVAR>){ chomp $_; push(@html, $_); } close FILEVAR; &filters(@html); foreach $_ (@html) {print "$_"} exit(0); sub filters { (@html) = @_; foreach $_ (@html) { if ($_ =~ /$re_link/gis) { $_ =~ (s/$re_link//gis); } } return(@html); } #-----------------------------------
the text loaded from "555 timer.html":
<tr> <td><img src="./555 timer_files/g_red_an.gif" border="0" width="12" he +ight="12"> <a href="https://homepages.westminster.org.uk/electronics/ +555.htm#reset"><em>RESET input</em></a></td> <td><a href="https://homepages.westminster.org.uk/electronics/555.htm# +links"><img src="./555 timer_files/g_red_an.gif" border="0" width="12 +" height="12"></a> <em><a href="https://homepages.westminster.org.uk/electronics/555.htm# +links">LINKS . . .</a></em></td> </tr> #-----------------------------------
well if you try it you will see that the "RESET" line always remains grrrrrrrrrr Any ideas apart from nuke it? :)

Replies are listed 'Best First'.
Re: Anchor parsing
by moritz (Cardinal) on Aug 01, 2011 at 13:28 UTC
Re: Anchor parsing
by jethro (Monsignor) on Aug 01, 2011 at 14:22 UTC

    You are trying to match a HTML link that is split over two lines, but you match your regex line by line!! So when the regex sees the first line it can't match because the closing tag is missing, then on the next line the opening tag is missing.

    This is why perlmonks usually give the advice to use a module to edit HTML instead of using a simple regex. To remedy your situation you could put the whole html into one string, but then you will notice that the regex will eat anything between the first opening link tag and the last closing link tag. To remedy that you would change ".*" to ".*?" so that the minimal match is found instead of the longest match. But now and then your pattern will still fail because for example there could be a HTML comment with a closing link tag (which would be correct HTML but not a real closing tag).