MeinName has asked for the wisdom of the Perl Monks concerning the following question:

Namaste, dear Perl Monks and Nuns!

I have a little site to administrate that depends on Perl scripts printing HTML and JavaScript. The site stores data in and loads data from a database using SOAP::Lite and up until last week everything ran very smoothly. Unfortunately our servers got an update that raised Perl from version 5.10.1 to 5.16.3 and there the problems started. Firstly all special characters, like ä, ö, ü, ß and € got mangled by the encoding. This error was resolved by using

use open qw(:std :utf8); use Encode; use utf8;

to load the data, by writing

print header(-type=>'text/html',-charset=>'utf-8'); print start_html(-title => "FooBar");

to start the web page and by using

my $text=decode("utf8", $query->param("textbox"));

to get the data back from the page and store it in the DB.

Today I had to realize that it's also no longer possible to load strings from the database that contain double quotes, because the HTML textboxes see the double quote as an end marker for the textbox value. (EDIT: Saving content with double quotes is possible. I verified this directly on database level.)

There are probably even a few more characters that cause issues that I haven't stumbled upon yet.

I found that I can resolve the issue by using HTML::Entities and calling encode_entities() on the variable I want to see in my textbox, but I am loading hundreds of variables and several arrays of strings in one page visit and there is no way I can revisit every single script in my application to see which variables I maybe need to encode.

Is there any way I can encode the whole HTML part of my script or a way to circumvent this mess?

Thanks alot in advance

Replies are listed 'Best First'.
Re: Escaping double quotes in complete document
by haukex (Archbishop) on Jun 26, 2017 at 17:46 UTC

    I haven't yet had much experience with CGI.pm and UTF-8 issues, so I can't comment much on that, except that using Devel::Peek has been a useful tool for me a few times when dealing with UTF-8 issues, as it helps you see what Perl thinks the string contains. Usually, if you get the encoding right at those points where the data enters and leaves the script, Perl should handle Unicode just fine. Anyway,

    the HTML textboxes see the double quote as an end marker for the textbox value

    This sounds to me like you might be building your HTML by interpolation, as in print qq{<input type="text" name="foo" value="$val">};? If so, that is certainly the source of the problem and you should use one of the available APIs to write your HTML instead, as they will do the escaping for you. Back before CGI.pm was discouraged, one way to do it was with its HTML generation functions (which are now deprecated). So nowadays the following is not recommended for new scripts, but note how the attribute is properly escaped:

    use CGI qw/:html :form/; my $val = q{ "Hello" <world> &amp; }; print textfield('foo',$val), "\n"; __END__ <input type="text" name="foo" value=" &quot;Hello&quot; &lt;world&gt; +&amp;amp; " />

    Currently, CGI::HTML::Functions recommends HTML::Tiny, which I haven't yet had the chance to try, and of course there are frameworks like Template::Toolkit or even Mojolicious, although the latter is meant to replace everything that CGI.pm does.

      Hi haukex,

      The site was already standing, when I entered the company so please don't facepalm on me over this, but here's how my HTML site is largely generated:

      print <<"EndOfText"; <html> <!--Foo--> <input type="text" name="mytext" id="mytext" value="$SOAPResult"/> <!--Bar--> </html> EndOfText

      The whole site is based on this system and I'm pretty sure my boss is going to kill me when I go to him saying "Yeah, we gotta change the whole thing... Gonna take about two weeks."

      If I have to stay with my running code, I get that I have to manually escape every HTML entity by hand, right?

        please don't facepalm on me over this

        No, I understand, but this is a very old style of generating HTML - probably my very first attempts at CGI scripts from over 20 years ago looked like this :-) But also, the issues with double quotes would have existed the entire time, even without the Perl upgrade. Also, I agree with huck that it's possible that maybe something has changed in the way the data gets handed to your script.

        my boss is going to kill me

        Well, if he needs further convincing, then tell him that HTML generation code like this exposes your customers to a Cross-site scripting (XSS) attack (longer explanation).

        I get that I have to manually escape every HTML entity by hand, right?

        I'm sorry to say yes. The minimal change needed to the code you showed is the following (encode_entities), keeping in mind that it encodes $SOAPResult once and then the value stays that way, so if you need the value for something else later you should modify a copy instead, like e.g. encode_entities(my $copy=$SOAPResult);

        use HTML::Entities qw/encode_entities/; my $SOAPResult = q{ "Hello" <world> &amp; }; encode_entities($SOAPResult); print <<"EndOfText"; <input type="text" name="mytext" id="mytext" value="$SOAPResult"/> EndOfText __END__ <input type="text" name="mytext" id="mytext" value=" &quot;Hello&quo +t; &lt;world&gt; &amp;amp; "/>

        What you show above would never have escaped double quotes, no matter what perl version you used, something else has changed.

        Was something keeping double quotes out of the database before and now it no longer checks? was something encoding them when they were being read from the database that is no longer doing it now?

        Hi again haukex and huck

        I would agree in the suspicion that something else than Perl has changed, but I already dived fairly deep into this.

        Java (that runs the JBoss server from which I get my input) hasn't been updated, because it is not a package available in my package manager, neither are the JBoss itself or the database host.
        The Apache has been updated, but as I understand it, it would have interpreted double quotes wrongly regardless of its version, right?
        That leaves Perl and my changes in the scripts as possible culprits. As I said, the update messed up the page encoding, because none of the scripts were explicitly using UTF-8 and most of them started the HTML page encoded as ISO-8859-1.

        I can confirm with 100% certainty that saving and loading UTF-8 characters and double quotes worked before the update.
        However, I do not know, if it didn't work at all after the update or if it started to malfunction this way after I enabled the scripts to use UTF-8 (though I think these errors are not necessarily connected).

        I appreciate your answers a lot and I will definitely propose updating our infrastructure according to your advice.

Re: Escaping double quotes in complete document
by thanos1983 (Parson) on Jun 26, 2017 at 13:02 UTC

    Hello MeinName,

    I am not familiar with the problem that you are having but I searched online and I found this module HTML::Entities give it a try.

    Example from documentation:

    use HTML::Entities; $a = "V&aring;re norske tegn b&oslash;r &#230res"; decode_entities($a); encode_entities($a, "\200-\377");

    Hope this helps, BR.

    Seeking for Perl wisdom...on the process of learning...not there...yet!

      Hi thanos,

      thank you for your reply. Unfortunately it is not exactly what I was looking for. As I wrote, I came across HTML::Entities myself, but there's just too many scripts with waaay too many variables and arrays and whatnot to go along and decode/encode every single one by hand.

      I am looking for a method to tell Perl "Go ahead and just encode everything you get so that HTML entities are correctly loaded." Kind of like

      use open qw(:std :utf8);

      is doing for UTF-8 encoding.

      Best regards

      MeinName

        Hello again MeinName,

        Apologies I did not notice that.

        The only alternative way that I found is to force your whole script to use UTF-8 by default. For example:

        $ PERL_UNICODE=S perl script.pl

        You can read further on perlrun/Command Switches.

        Give it a try.

        Hope this helps.

        Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: Escaping double quotes in complete document
by holli (Abbot) on Jun 26, 2017 at 17:29 UTC
    How do you output the html? print statements to STDOUT or some kind of templating system? Also what is the source of the input? A file or a database? Maybe you should reverse the problem and write a one off script that converts the input data once and for good.


    holli

    You can lead your users to water, but alas, you cannot drown them.

      Hi holli,

      for the output method I'm using, please refer to my reply to haukex.

      My input comes from multiple Java functions running on a JBoss Server that is connected to a database. I call them via SOAP::Lite and read the returned data objects

      I will look into changing those to automatically encode HTML entities. That may be a quicker and more thorough solution than sifting through every script I have

Re: Escaping double quotes in complete document
by holli (Abbot) on Jun 27, 2017 at 14:32 UTC
    This is the minimal code to use a templating engine named Template::Toolkit.
    my $tt = Template->new(); $tt->process( \q[<html> <!--Foo--> <input type="text" name="mytext" id="mytext" value="[% soap_result % +]"/> <!--Bar--> </html>], { soap_result => $soap_result }) || die $tt->error(), "\n";
    This protects you from Cross-Site-Scripting attacks and handles the double quote issue.


    holli

    You can lead your users to water, but alas, you cannot drown them.
      This protects you from Cross-Site-Scripting attacks and handles the double quote issue.

      Not quite, you're missing the html filter, e.g.:

      use Template; my $tt = Template->new(); my $soap = ' "foo" <bar> &amp; '; $tt->process(\<<END, {soap=>$soap}) || die $tt->error(); <html> <input type="text" name="mytext" value="[% soap %]"/> </html> END $tt->process(\<<END, {soap=>$soap}) || die $tt->error(); <html> <input type="text" name="mytext" value="[% soap | html %]"/> </html> END __END__ <html> <input type="text" name="mytext" value=" "foo" <bar> &amp; "/> </html> <html> <input type="text" name="mytext" value=" &quot;foo&quot; &lt;bar&gt; + &amp;amp; "/> </html>
        Well, it's been a while :-)


        holli

        You can lead your users to water, but alas, you cannot drown them.
Re: Escaping double quotes in complete document
by MeinName (Novice) on Jun 29, 2017 at 07:16 UTC

    Thank you all for your ideas and help on this matter!

    I talked with my boss and he talked with his boss and it seems we are either revamping the entire site to use toolkits for creating HTML code or we are going to change our infrastructure to get away from completely web based appliances.

    Again, thank you all for your time and help!