Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

#!/usr/bin/perl while($line = <DATA>){ @valid_entities= ('<a>','<abbr>','<acronym>','<br>'); my %htmlenties = map { $_ =>1 } @valid_entities; #$line =~ s/(<\w*?>(?![^<\w*?>]*<\/\w*?>))/$1/g; #$line =~ s/<\w*>(?![^<\/\w*>]*>)/&lt;/g; $line =~ s/(<(\w*?)>)/exists $htmlenties{$1} ? $1 : defined ($2) ? "& +lt;$2&gt;" : "&lt;"/eg; print $line; } __DATA__ <helloe>How r u <a> www.google.com</a> <hi>How r u </hi><et,-2><><br/>
How to convert character '<' to '&lt;' and '>' to &gt; for all the characters which is not present in the array
and only if it doesnot have a closing end tag. in the above example, the output should look like
&lt;helloe&gt;How r u <a> www.google.com</a> <hi>How r u </hi>&lt;et,-2&gt;&lt;&gt;<br/>

Replies are listed 'Best First'.
Re: convert characters
by Utilitarian (Vicar) on Aug 27, 2009 at 12:24 UTC
    Substitute the following
    @valid_entities= ('a','abbr','acronym','br');# remove tags my %htmlenties = map { $_ =>1 } @valid_entities; $line =~ s/ < # a tag open ( # begin capture group 1 \/? # an optional slash ( # begin capture group2 [^>]*? any number of characters that aren't closing tags ) # end capture 2 \/? # an optional slash (xhtml and all that) ) # end capture group 1 > # a closing tag /exists $htmlenties{$2} ? "<$1>" : defined ($1) ? "&lt;$1&gt;" : "&lt; +"/xeg;# different captures => different process
      #!/usr/bin/perl while($line = <DATA>){ @valid_entities= ('<a>','<abbr>','<acronym>'); my %htmlenties = map { $_ =>1 } @valid_entities; @valid_entities= ('a','abbr','acronym','br');# remove tags my %htmlenties = map { $_ =>1 } @valid_entities; $line =~ s/<(\/?([^>]*?)\/?)>/exists $htmlenties{$2} ? "<$1>" : defin +ed ($1) ? "&lt;$1&gt;" : "&lt;"/xeg; print $line; <helloe>How r u <a> www.google.com</a> <hi>How r u </hi><et,-2><>
      From the above code the output which i got is
      &lt;helloe&gt;How r u <a> www.google.com</a> &lt;hi&gt;How r u &lt;/hi&gt;&lt;et,-2&gt;&lt;&gt;
      But the expected output is
      &lt;helloe&gt;How r u <a> www.google.com</a> <hi>How r u </hi>&lt;et,-2&gt;&lt;&gt;
      As '<hi>' as the '</hi>' I shouldn't replace it.
        Then add 'hi'to your valid entities array
        Did you write any of this code yourself?

        Update:My addition is not a complete solution by any means, for example it will fail to correctly interpret the following, (organising it so that it does could be a worthwhile exercise for you):

        <hi > <a href="http://permonks.org">Link</a> <input value="Next>" type="submit">