Re: Storing UTF-8 data into database from scraped web page
by haj (Vicar) on Jun 14, 2018 at 19:08 UTC
|
The DBD::mysql documentation states that setting mysql_enable_utf8 has only effect for incoming data if it is used as part of the call to connect(). Your snippet $dbh->{mysql_enable_utf8} = 1 indicates that you've set the attribute to an already connected database handle - this is too late.
So either specify the attribute in the call to connect(), or execute the statement SET NAMES utf8. | [reply] [d/l] [select] |
|
|
| [reply] |
Re: Storing UTF-8 data into database from scraped web page
by Your Mother (Archbishop) on Jun 14, 2018 at 20:46 UTC
|
Every part of the handling must be correct; in and out. Scraper is probably doing the right thing, returning utf-8 decoded strings into your Perl. Your DB level is the most likely problem. You can try to enable UTF-8 handling but is the table or the table.column using that charset? If it's not, you're maybe stuffing binary UTF-8 into Latin-1. Look at the table in question; pull the LIMIT here or put in a WHERE–
mysql> SELECT SCHEMA_NAME, DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATI
+ON_NAME FROM information_schema.SCHEMATA limit 1;
+--------------------+----------------------------+-------------------
+-----+
| SCHEMA_NAME | DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_
+NAME |
+--------------------+----------------------------+-------------------
+-----+
| information_schema | utf8 | utf8_general_ci
+ |
+--------------------+----------------------------+-------------------
+-----+
If you get latin1 and latin1_swedish_ci (one of mysql's many awful defaults) then you should update the table, the column, or both to CHARACTER SET utf8 COLLATE utf8_general_ci. This is potentially dangerous for data there so backup your DB first.
Update: s/encoded utf-8/decoded utf-8/ to maintain pedantic semantics which are ultimately less confusing. | [reply] [d/l] [select] |
Re: Storing UTF-8 data into database from scraped web page
by choroba (Cardinal) on Jun 14, 2018 at 20:32 UTC
|
Note that DBD::mysql is broken, especially its encoding support. You might try DBD::MariaDB instead - it's not yet production ready, but it tries to fix all the old problems. See the mailing list for the announcement and links to details.
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] |
|
|
I have no illusions that MySQL or its clients are top shelf tools but I have been using DBD::mysql for almost 20 years at this point without anything I would call broken behavior, just lots of terrible default behavior that acts broken until you sort it out. So could you elaborate on the broken encoding support?
| [reply] |
|
|
See for example #20, #23, #35, or #47. See also dbdimp.c:5933 and below on how numbers were recognized in DBD::mysql. For more information, read any other pull request.
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] |
Re: Storing UTF-8 data into database from scraped web page
by thanos1983 (Parson) on Jun 14, 2018 at 18:54 UTC
|
#!/usr/bin/env perl
use Encode;
use strict;
use warnings;
print encode_utf8("<p>What\x{2019}s up with the water ??</p>"), $/;
__END__
$ perl test.pl
<p>What’s up with the water ??</p>
Hope this helps, BR
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] [select] |
|
|
Hello nysus,
Did you see the updated sample of code? This did not worked for you?
I just run one more last test with HTML::Entities module and both cases either including the HTML entity or the the code tag worked.
See sample of code:
BR / Thanos
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
|
|
Yeah, I tried that. It just makes things even uglier. Output from mysql looks like this:
<p><p>Whatâ s up with the water ??</p>
Output when I dump the scraped content looks like this:
<p>Whatâ~@~Ys up with the water ??</p>
| [reply] [d/l] [select] |
|
|
#!/usr/bin/perl
use utf8;
use strict;
use warnings;
use Text::Unidecode;
my $encode = unidecode("What\x{2019}s up with the water ??");
print $encode . "\n";
__END__
$ perl test.pl
What's up with the water ??
Update: Not to forget you need to define also the column in your table as:
`Column` VARCHAR(150) CHARACTER SET utf8 NOT NULL UNIQUE,
You do not the parameter NULL UNIQUE I just usually add them on my columns so I will avoid duplications etc.
Update2: Sample of the whole code that I tested:
The conf.ini file:
Hope this helps, BR.
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
Re: Storing UTF-8 data into database from scraped web page
by nysus (Parson) on Jun 14, 2018 at 22:12 UTC
|
OK, not exactly sure how or why, but switching my database from UTF8 to utf8mb4 worked. See this guide for details.
| [reply] |
Re: SOLVED: Storing UTF-8 data into database from scraped web page
by Anonymous Monk on Jun 15, 2018 at 13:14 UTC
|
Another trick that can help in diagnosing problems like these is to write a script that dumps the hexadecimal bytes. There are just too many places where encoding can be attempted by different participating pieces of software all of which are trying their best to be helpful. I have even been known to go to the actual disk-files where the information is stored and display some of the pages with the hexdump command. (Likewise the incoming data from the site: "wget" it with all language-features turned off and hexdump that file. Determine what are the bytes that are coming in from the site and what are the bytes that are getting written to disk. From this you can piece together where conversion is happening and where you want it to be happening. | [reply] |