UTF-8 and PSGI/Starman vs. CGI

dsheroh has asked for the wisdom of the Perl Monks concerning the following question:

I have two originally-identical webapp servers, both connecting to the same back end database. When they were first set up, both of them ran the webapp through apache and plain old CGI. Since CGI was being extremely slow (who would've thought?), I changed one of the servers over to nginx/PSGI/starman.

Now I have triple- or quadruple-encoding (I'm not sure which) issues with UTF-8 characters on the starman box, but UTF-8 works perfectly on the apache box.

I get the exact same results regardless of whether I talk to nginx or go to the starman server directly, so I'm confident that nginx is not the issue. It's also plainly not the database because both of them are connecting to the same database. The webapp source directory is on the same git branch and commit with no untracked files on either machine, so they should be running identical code and their config files have the same md5sum. As far as I can tell, apache/CGI vs. starman/PSGI is the only difference between them.

As an example, doing a search in the webapp for "ångbåten" returns 10 hits for "ångbåten" on the apache server, and 0 hits for "ÃƒÂ¥ngbÃƒÂ¥ten" via starman. Since it's triple/quadruple-encoded and returns no hits, I figure it must be getting double-encoded on the way in (when it reads the search terms), then double-encoded again on the way out (when displaying the results).

Any ideas for what I might have messed up in either the starman startup or the PSGI file to cause this issue?

Bonus Question: Is there any way to get starman to put timestamps in its error log? It would be very useful to be able to tell whether an error just happened or if it's three days old.

Comment on UTF-8 and PSGI/Starman vs. CGI

Replies are listed 'Best First'.
Re: UTF-8 and PSGI/Starman vs. CGI by Your Mother (Archbishop) on Mar 21, 2018 at 16:39 UTC
Probably your setup is broken all the way through and it only acts correctly in the CGI/Apache level because it's broken in the same ways in the same places; e.g., maybe you're not reading or writing UTF-8 in your DB but stuffing the bytes into Latin-1 or something and they are read the same way they are written so the breakage is transparent and seems correct. Every step has to be right for UTF-8 to be robust/correct. That means the HTTP headers, the webserver's output of the code's output, the forms and the decoding of their input, the DB, the code that reads and writes the DB all have to be properly declared/configured and encode and decode in agreement. I would start with the DB because that's usually the root of the problem in my experience. Google for "check table charset {DB type}" or something like that to make sure the tables are using UTF-8. Then work backwards. DBI+yourDBD calls with proper UTF-8 setup. Then decoding form input and matching output level encoding in your code. Then app server (FCGI/nginx). Then webserver. Making errors fatal at all levels is helpful too. Tangent: uWSGI is a better choice than Starman.	[reply]
Re^2: UTF-8 and PSGI/Starman vs. CGI by stevieb (Canon) on Mar 21, 2018 at 16:49 UTC
Tangent: uWSGI is a better choice than Starman. May I request your reasoning/opinion on this? Curious as I use Starman for one of my larger projects, and haven't looked at any other options since day one as it just worked.	[reply]
Re^3: UTF-8 and PSGI/Starman vs. CGI by Your Mother (Archbishop) on Mar 21, 2018 at 17:14 UTC
The controls and options are deeper and it is much more robust. Starman starts dropping requests and such under load. I suspect my problems were largely an edge case caused by legacy code and EOL'd Linux but I had nothing but straight up segfaults and mysterious socket failures pointing to ancient unconfirmed tickets trying to get Starman working at work. Here is one of many benchmarks out there. I really wanted to like Starman better. I'm gung-ho for Perl even when it's not the best option but in this case, for me at least, there was nothing at all to recommend the Perl side.	[reply]
Re^4: UTF-8 and PSGI/Starman vs. CGI by karlgoethebier (Abbot) on Mar 22, 2018 at 11:39 UTC
Re^5: UTF-8 and PSGI/Starman vs. CGI by dsheroh (Monsignor) on Mar 22, 2018 at 12:34 UTC
Re^4: UTF-8 and PSGI/Starman vs. CGI by stevieb (Canon) on Mar 21, 2018 at 17:33 UTC
Re^2: UTF-8 and PSGI/Starman vs. CGI by dsheroh (Monsignor) on Mar 22, 2018 at 08:06 UTC
Probably your setup is broken all the way through and it only acts correctly in the CGI/Apache level because it's broken in the same ways in the same places Ugh. That's a possibility I really don't want to think about, because the webapp in question isn't a small, in-house project, it's a very large open source project (Koha, to be specific) and we really don't have the time or manpower to do a thorough audit of how it handles character encodings. Still, double-checking the database settings and taking a look at uWSGI are low-hanging fruit which can easily fit into the schedule, so I'll at least cross my fingers and try those before doing anything drastic. Thanks!	[reply]
Re^3: UTF-8 and PSGI/Starman vs. CGI by Your Mother (Archbishop) on Mar 22, 2018 at 14:42 UTC
Yeah. No fun at all if so. It was so for us and took a lot of work to fix. I don't know if this is the right place, but in case you haven't seen it -> Charsets/Encoding in Koha. I still recommend uWSGI but I don't think it will help with encoding problems, just performance and stability.	[reply]
Re^4: UTF-8 and PSGI/Starman vs. CGI by dsheroh (Monsignor) on Mar 23, 2018 at 11:07 UTC
Re: UTF-8 and PSGI/Starman vs. CGI by 1nickt (Canon) on Mar 21, 2018 at 15:55 UTC
Is there any way to get starman to put timestamps in its error log? It would be very useful to be able to tell whether an error just happened or if it's three days old. You can run your `starman` script under `daemontools` as a service and get the log with `multilog`, which prepends a hi-res timestamp to each line. (https://cr.yp.to/daemontools.html) (In fact, you can probably use `multilog` to capture the output of any program, see https://cr.yp.to/daemontools/multilog.html.) The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re: UTF-8 and PSGI/Starman vs. CGI by Anonymous Monk on Mar 22, 2018 at 12:34 UTC
Is it remotely possible that one of them thinks it is UTF-16? A UTF data-stream should not be "double encoded." The multibyte character sequences are self-identifying as such – when you look at any byte (or 16-bit word) you can tell if it is part of a sequence, and those characters won't be "encoded" a second time. Do all the players in this game know to expect UTF? A hex-dump program can be very helpful here ... you need to see the actual bytes.	[reply]


Don't ask to ask, just ask
	PerlMonks