in reply to Escaping double quotes in complete document

I haven't yet had much experience with CGI.pm and UTF-8 issues, so I can't comment much on that, except that using Devel::Peek has been a useful tool for me a few times when dealing with UTF-8 issues, as it helps you see what Perl thinks the string contains. Usually, if you get the encoding right at those points where the data enters and leaves the script, Perl should handle Unicode just fine. Anyway,

the HTML textboxes see the double quote as an end marker for the textbox value

This sounds to me like you might be building your HTML by interpolation, as in print qq{<input type="text" name="foo" value="$val">};? If so, that is certainly the source of the problem and you should use one of the available APIs to write your HTML instead, as they will do the escaping for you. Back before CGI.pm was discouraged, one way to do it was with its HTML generation functions (which are now deprecated). So nowadays the following is not recommended for new scripts, but note how the attribute is properly escaped:

use CGI qw/:html :form/; my $val = q{ "Hello" <world> &amp; }; print textfield('foo',$val), "\n"; __END__ <input type="text" name="foo" value=" &quot;Hello&quot; &lt;world&gt; +&amp;amp; " />

Currently, CGI::HTML::Functions recommends HTML::Tiny, which I haven't yet had the chance to try, and of course there are frameworks like Template::Toolkit or even Mojolicious, although the latter is meant to replace everything that CGI.pm does.

Replies are listed 'Best First'.
Re^2: Escaping double quotes in complete document
by MeinName (Novice) on Jun 27, 2017 at 06:34 UTC

    Hi haukex,

    The site was already standing, when I entered the company so please don't facepalm on me over this, but here's how my HTML site is largely generated:

    print <<"EndOfText"; <html> <!--Foo--> <input type="text" name="mytext" id="mytext" value="$SOAPResult"/> <!--Bar--> </html> EndOfText

    The whole site is based on this system and I'm pretty sure my boss is going to kill me when I go to him saying "Yeah, we gotta change the whole thing... Gonna take about two weeks."

    If I have to stay with my running code, I get that I have to manually escape every HTML entity by hand, right?

      please don't facepalm on me over this

      No, I understand, but this is a very old style of generating HTML - probably my very first attempts at CGI scripts from over 20 years ago looked like this :-) But also, the issues with double quotes would have existed the entire time, even without the Perl upgrade. Also, I agree with huck that it's possible that maybe something has changed in the way the data gets handed to your script.

      my boss is going to kill me

      Well, if he needs further convincing, then tell him that HTML generation code like this exposes your customers to a Cross-site scripting (XSS) attack (longer explanation).

      I get that I have to manually escape every HTML entity by hand, right?

      I'm sorry to say yes. The minimal change needed to the code you showed is the following (encode_entities), keeping in mind that it encodes $SOAPResult once and then the value stays that way, so if you need the value for something else later you should modify a copy instead, like e.g. encode_entities(my $copy=$SOAPResult);

      use HTML::Entities qw/encode_entities/; my $SOAPResult = q{ "Hello" <world> &amp; }; encode_entities($SOAPResult); print <<"EndOfText"; <input type="text" name="mytext" id="mytext" value="$SOAPResult"/> EndOfText __END__ <input type="text" name="mytext" id="mytext" value=" &quot;Hello&quo +t; &lt;world&gt; &amp;amp; "/>

      What you show above would never have escaped double quotes, no matter what perl version you used, something else has changed.

      Was something keeping double quotes out of the database before and now it no longer checks? was something encoding them when they were being read from the database that is no longer doing it now?

      Hi again haukex and huck

      I would agree in the suspicion that something else than Perl has changed, but I already dived fairly deep into this.

      Java (that runs the JBoss server from which I get my input) hasn't been updated, because it is not a package available in my package manager, neither are the JBoss itself or the database host.
      The Apache has been updated, but as I understand it, it would have interpreted double quotes wrongly regardless of its version, right?
      That leaves Perl and my changes in the scripts as possible culprits. As I said, the update messed up the page encoding, because none of the scripts were explicitly using UTF-8 and most of them started the HTML page encoded as ISO-8859-1.

      I can confirm with 100% certainty that saving and loading UTF-8 characters and double quotes worked before the update.
      However, I do not know, if it didn't work at all after the update or if it started to malfunction this way after I enabled the scripts to use UTF-8 (though I think these errors are not necessarily connected).

      I appreciate your answers a lot and I will definitely propose updating our infrastructure according to your advice.

        I can confirm with 100% certainty that saving and loading UTF-8 characters and double quotes worked before the update.

        I am unclear on this for several reasons. First, does this mean that you have both the new and the old software environment available in parallel? If so, one obvious approach to debugging it would be to pepper the script with plenty of debugging output until the differences and the origin of those differences becomes clear. Second, can you be more specific on the double quotes issue, in what way did it previously "work" compared to now? Again, showing a short but representative code snippet (SSCCE) along with sample input, the expected output and the actual output including exact error messages is most helpful here - if we can reproduce the issue locally we will be able to help much better.

        While I can't say much about how Java, JBoss, or Apache plays a role in all this, I can say with a high degree of certainty that there have been no changes in Perl itself between v5.10.1 and v5.16.3 that would explain different handling of double quotes in strings interpolated into the HTML (while there has been a lot of work done on Unicode handling over the years). I would agree that based on your descriptions so far it is unlikely that the issue of Unicode/UTF-8 handling is connected to the issue of double quotes.

        Code like print qq{<input type="text" name="mytext" value="$SOAPResult"/>}; would always have been a problem if $SOAPResult contained double quotes.

        Since you say the HTML is output with print and not a module, and it sounds like there was previously no module being used to do escaping (like HTML::Entities or CGI.pm's escapeHTML), the only other possible culprit might be whatever Perl modules you are using to fetch and store the data (SOAP::Lite, or perhaps DBI), or the source of the data itself. For example, as huck said, perhaps there was previously a validation of data going into the database that kept double quotes out of the database entries, or there was some layer somewhere doing escaping of the values coming out of the database. Or it's something simpler like there never was any validation, and over time you've simply accumulated more and more records in the DB that bring to light this issue which was always present.

        Without more information (= code to reproduce), this is a bit of a stab in the dark, but my best guess at the moment is that you could start with looking at what SOAP::Lite is doing, i.e. what strings it was giving your code before and after the upgrade.