Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

UTF-8 and PSGI/Starman vs. CGI

by dsheroh (Monsignor)
on Mar 21, 2018 at 15:34 UTC ( [id://1211433]=perlquestion: print w/replies, xml ) Need Help??

dsheroh has asked for the wisdom of the Perl Monks concerning the following question:

I have two originally-identical webapp servers, both connecting to the same back end database. When they were first set up, both of them ran the webapp through apache and plain old CGI. Since CGI was being extremely slow (who would've thought?), I changed one of the servers over to nginx/PSGI/starman.

Now I have triple- or quadruple-encoding (I'm not sure which) issues with UTF-8 characters on the starman box, but UTF-8 works perfectly on the apache box.

I get the exact same results regardless of whether I talk to nginx or go to the starman server directly, so I'm confident that nginx is not the issue. It's also plainly not the database because both of them are connecting to the same database. The webapp source directory is on the same git branch and commit with no untracked files on either machine, so they should be running identical code and their config files have the same md5sum. As far as I can tell, apache/CGI vs. starman/PSGI is the only difference between them.

As an example, doing a search in the webapp for "ångbåten" returns 10 hits for "ångbåten" on the apache server, and 0 hits for "Ã¥ngbÃ¥ten" via starman. Since it's triple/quadruple-encoded and returns no hits, I figure it must be getting double-encoded on the way in (when it reads the search terms), then double-encoded again on the way out (when displaying the results).

Any ideas for what I might have messed up in either the starman startup or the PSGI file to cause this issue?

Bonus Question: Is there any way to get starman to put timestamps in its error log? It would be very useful to be able to tell whether an error just happened or if it's three days old.

Replies are listed 'Best First'.
Re: UTF-8 and PSGI/Starman vs. CGI
by Your Mother (Archbishop) on Mar 21, 2018 at 16:39 UTC

    Probably your setup is broken all the way through and it only acts correctly in the CGI/Apache level because it's broken in the same ways in the same places; e.g., maybe you're not reading or writing UTF-8 in your DB but stuffing the bytes into Latin-1 or something and they are read the same way they are written so the breakage is transparent and seems correct.

    Every step has to be right for UTF-8 to be robust/correct. That means the HTTP headers, the webserver's output of the code's output, the forms and the decoding of their input, the DB, the code that reads and writes the DB all have to be properly declared/configured and encode and decode in agreement.

    I would start with the DB because that's usually the root of the problem in my experience. Google for "check table charset {DB type}" or something like that to make sure the tables are using UTF-8. Then work backwards. DBI+yourDBD calls with proper UTF-8 setup. Then decoding form input and matching output level encoding in your code. Then app server (FCGI/nginx). Then webserver. Making errors fatal at all levels is helpful too.

    Tangent: uWSGI is a better choice than Starman.

      Tangent: uWSGI is a better choice than Starman.

      May I request your reasoning/opinion on this? Curious as I use Starman for one of my larger projects, and haven't looked at any other options since day one as it just worked.

        The controls and options are deeper and it is much more robust. Starman starts dropping requests and such under load. I suspect my problems were largely an edge case caused by legacy code and EOL'd Linux but I had nothing but straight up segfaults and mysterious socket failures pointing to ancient unconfirmed tickets trying to get Starman working at work. Here is one of many benchmarks out there. I really wanted to like Starman better. I'm gung-ho for Perl even when it's not the best option but in this case, for me at least, there was nothing at all to recommend the Perl side.

      Probably your setup is broken all the way through and it only acts correctly in the CGI/Apache level because it's broken in the same ways in the same places
      Ugh. That's a possibility I really don't want to think about, because the webapp in question isn't a small, in-house project, it's a very large open source project (Koha, to be specific) and we really don't have the time or manpower to do a thorough audit of how it handles character encodings.

      Still, double-checking the database settings and taking a look at uWSGI are low-hanging fruit which can easily fit into the schedule, so I'll at least cross my fingers and try those before doing anything drastic. Thanks!

        Yeah. No fun at all if so. It was so for us and took a lot of work to fix. I don't know if this is the right place, but in case you haven't seen it -> Charsets/Encoding in Koha. I still recommend uWSGI but I don't think it will help with encoding problems, just performance and stability.

Re: UTF-8 and PSGI/Starman vs. CGI
by 1nickt (Canon) on Mar 21, 2018 at 15:55 UTC

    Is there any way to get starman to put timestamps in its error log? It would be very useful to be able to tell whether an error just happened or if it's three days old.

    You can run your starman script under daemontools as a service and get the log with multilog, which prepends a hi-res timestamp to each line. (https://cr.yp.to/daemontools.html)

    (In fact, you can probably use multilog to capture the output of any program, see https://cr.yp.to/daemontools/multilog.html.)


    The way forward always starts with a minimal test.
Re: UTF-8 and PSGI/Starman vs. CGI
by Anonymous Monk on Mar 22, 2018 at 12:34 UTC
    Is it remotely possible that one of them thinks it is UTF-16? A UTF data-stream should not be "double encoded." The multibyte character sequences are self-identifying as such – when you look at any byte (or 16-bit word) you can tell if it is part of a sequence, and those characters won't be "encoded" a second time. Do all the players in this game know to expect UTF? A hex-dump program can be very helpful here ... you need to see the actual bytes.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1211433]
Approved by hippo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-04-16 18:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found