gem555 has asked for the wisdom of the Perl Monks concerning the following question:

I have a code below.
#!/usr/bin/perl while(<DATA>){ s/<a\s\w*>\w*</<\/a>/; print $_; } __DATA__ <a href="www.google.txt>Click here<span>
How to close the tag if ending tag doesnot exists before any start of another '<'.
<a href="www.google.txt>Click here<span>

Replies are listed 'Best First'.
Re: close end tag
by jbt (Chaplain) on Aug 21, 2009 at 10:47 UTC
    I am not sure what you are ultimately trying to accomplish but you may want to consider a module like HTML::Scrubber to clean up your HTML.
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: close end tag
by Utilitarian (Vicar) on Aug 21, 2009 at 11:01 UTC
    Hi gem,

    If you want to use the pattern above you need to correct your regular expression.

    you are matching:

    <a # begin anchor tag \s # followed by a space \w* # followed by any number of alpha-numeric characters # this fails to match ["'=/:%?&] and probably many other potential + permissable chars in this context and so your regex fails here. > # end anchor tag \w* # any text as long as it's all one word see above point you should + be matching any char which is not the beginning of a tag < # beginning of a new tag including an end of anchor
    You are replacing your match with just an end of anchor tag, you need to capture what you match (possibly two distinct matches or a lookahead to check for end of anchor) and insert the end of anchor there.

    You have the basis of a tolerable regex but it needs finishing.

    Try and implement the above and let us know how you get on

Re: close end tag
by Anonymous Monk on Aug 21, 2009 at 11:23 UTC
    #!/usr/bin/perl -- use strict; use warnings; use YAPE::HTML 1.11; while(<DATA>){ # adds missing closing tags print YAPE::HTML->new($_)->display,"\n"; } __DATA__ <a href="www.google.txt>Click here<span>
    Outputs
    <a href="www.google.tx">Click here<span> </span></a>
Re: close end tag
by Anonymous Monk on Aug 21, 2009 at 10:49 UTC
    if( closing tag is mising ){
      print closingtag
    }
    so i think
    print '</a>' if -1 == index lc $_, '</a>';