Anchor parsing

cromiumlake has asked for the wisdom of the Perl Monks concerning the following question:

Hello Masters Of the Universe, look, this is the minimal expression from the exact bit where I'm getting the conflict:

 
#-----------------------------------
#!/usr/bin/perl


use strict;
use warnings;
# use re 'debug';


my $workingfolder = '/home/crom/Documents/perl_scripts/html_cleaner/te
+st';
my $file_name = '555 timer.html';

my $link = '<a.*(?=<\/a>)<\/a>';
my $re_link = qr/$link/;

open(FILEVAR, "<$workingfolder/$file_name") or die "cannot open $file_
+name: $!";
my @html = '';



while(<FILEVAR>){
chomp $_;
push(@html, $_);

}



close FILEVAR;

&filters(@html);

foreach $_ (@html) {print "$_"}

exit(0);



sub filters {
(@html) = @_;

foreach $_ (@html) {


if ($_ =~ /$re_link/gis) {
$_ =~ (s/$re_link//gis);
}


}

return(@html);
}

#-----------------------------------
[download]

the text loaded from "555 timer.html":

<tr>
<td><img src="./555 timer_files/g_red_an.gif" border="0" width="12" he
+ight="12"> <a href="https://homepages.westminster.org.uk/electronics/
+555.htm#reset"><em>RESET
input</em></a></td>
<td><a href="https://homepages.westminster.org.uk/electronics/555.htm#
+links"><img src="./555 timer_files/g_red_an.gif" border="0" width="12
+" height="12"></a>
<em><a href="https://homepages.westminster.org.uk/electronics/555.htm#
+links">LINKS . . .</a></em></td>
</tr>


#-----------------------------------
[download]

well if you try it you will see that the "RESET" line always remains grrrrrrrrrr Any ideas apart from nuke it? :)

Comment on Anchor parsing Select or Download Code

Replies are listed 'Best First'.
Re: Anchor parsing by moritz (Cardinal) on Aug 01, 2011 at 13:28 UTC
`my $link = '<a.(?=<\/a>)<\/a>';` You likely want `my $re = qr{<a(?:(?!</a>).)</a>}s` (negative look-ahead) or something. Doing a positive look-ahead for the thing you're then matching doesn't make any sense. Or just use the right tool for the job. Perl 6 - second systems done right	[reply] [d/l] [select]
Re: Anchor parsing by jethro (Monsignor) on Aug 01, 2011 at 14:22 UTC
You are trying to match a HTML link that is split over two lines, but you match your regex line by line!! So when the regex sees the first line it can't match because the closing tag is missing, then on the next line the opening tag is missing. This is why perlmonks usually give the advice to use a module to edit HTML instead of using a simple regex. To remedy your situation you could put the whole html into one string, but then you will notice that the regex will eat anything between the first opening link tag and the last closing link tag. To remedy that you would change "." to ".?" so that the minimal match is found instead of the longest match. But now and then your pattern will still fail because for example there could be a HTML comment with a closing link tag (which would be correct HTML but not a real closing tag).	[reply]