Sorry, this article is presently not available in your selected language.
Garbage Text Discovered
I was recently forced to upgrade a client's MySQL server from 5.7 to 8.0 as Amazon was about to start charging extra for using an outdated version. Using the automated AWS upgrade went smoothly for the production server, but it failed to upgrade the staging server due to the pre-patch compatibility test failing. I decided that the easiest fix would be to just create a new server already running MySQL 8.0 and import the database via SQL dump.
Doing so seemed to go smoothly, but then while testing the staged upgrade it was noticed that the multi-byte charcters were showing up as garbage text. When I first joined the project there was already a lot of mis-encoded “garbage” characters in the records from what I had assumed was users copy-pasting funky OCR text, but when I took a closer look I found that recently created records were garbled too. This was easy to miss as most of the multi-byte characters were just angled single-quotemarks and em dashes input by MacOS users.
The short of it is that the MySQL server had originally been set up to use utf8mb3
character encoding (then recently upgraded to utf8mb4
), while the PHP application that used it did not have encoding explicitly configured anywhere, apparently defaulting to latin1
. The result was that the text was being stored incorrectly, but the database and the application were synchronized such that two wrongs actually made a right: While new text was translated into garbage for storage, which showed up as garbage in database clients, it was translated back into non-garbage when viewed in the website.
I will now show you how we figured this out, and how we ultimately fixed it.
The Investigation
When investigating the character encoding of your MySQL database, one of the first things you should look at is its character set configurations. Use this query to do so:
mysql> SELECT @@character_set_client, @@character_set_connection, @@character_set_results;
Using DBeaver as my desktop SQL client, the results I got were, utf8mb4, utf8mb4, NULL
. This is somewhat correct, as we want all three reults to be utf8mb4
. This was a bit misleading however as I got different results when I queried from a MySQL command line terminal: latin1, latin1, latin1
. This should have been my first clue, but unfornately it wasn't until late into the investigation that I looked into this.
For a control case, I used “Nihongo: 日本語
” as I could quickly locate “Nihongo
” and see whether “日本語
” remained intact. When this text was garbled, 日本語
displayed as garbage characters 日本語
. It looked fine in the web application but looked like garbage in desktop clients. I tried exporting the database various ways, but the dumped queries seemed to always be corrupt: Either they looked correct at import but turned into garbage it import, or they exported as garbage and remained garbage when imported.
Much thanks to my co-worker T.R. who focused on 語
to figure out the specifics of the bungled character encoding:
- PHP had been unknowingly storing UTF-8 strings as
latin1
in MySQL. So PHP would send語
(UTF-8E8 AA 9E
) . . . - . . . and MySQL would store it in
latin1
as語
. - When the columns were changed to
utf8mb4
,語
was transcoded again into UTF-8:C3 A8 C2 AA C2 9E
.
Therefore:
- the database was storing
語
asC3 A8 C2 AA C2 9E
. . . - . . . and because PHP did not specify any preferred character set for the MySQL connection, MySQL would transcode it back to
latin1
when sending it to PHP as語
(E8 AA 9E
) . . . - . . . and then PHP would iterpret it as UTF-8:
語
(E8 AA 9E
).
T.R. came up with this proof-of-concept query (and I cleaned it up to correct deprecated syntax) with “mojibake” to stand in as the “character monster”:
mysql> SET @a := CONVERT((
SELECT mojibake
FROM
(
SELECT _utf8mb4 '語' AS mojibake
) AS _1
) USING latin1);
SET @b = convert(cast(@a as binary) using utf8mb4);
SET NAMES utf8mb4;
SELECT @@character_set_client "client",
@@character_set_connection "connection",
@@character_set_results "results",
_1.mojibake,
HEX(_1.mojibake) "mojibake_as_hex",
@a "as_latin1",
HEX(@a) "latin1_hex",
@b "latin1_as_utf8",
HEX(@b) "utf8_hex",
CONVERT(CAST(CONVERT(_1.mojibake using latin1) AS BINARY) USING utf8mb4) "condensed"
FROM (
SELECT _utf8mb4 '語' AS mojibake
) AS _1
Which resulted in:
Name | Value | As Hex |
---|---|---|
client | utf8mb4 | |
connection | utf8mb4 | |
results | utf8mb4 | |
mojibake | 語 | C3A8C2AAC5BE |
as_latin1 | 語 | E8AA9E |
latin1_as_utf8 | 語 | E8AA9E |
condensed | 語 | E8AA9E |
Solutions
To repaire the characters I adapted T.R.'s encoding test query into an UPDATE
query to do so. This was the general form:
mysql> UPDATE `contacts`
SET `notes` = CONVERT(CAST(CONVERT(`notes` USING latin1) AS BINARY) USING utf8mb4)
WHERE HEX(CONVERT(CAST(CONVERT(`notes` USING latin1) AS BINARY) USING utf8mb4)) IS NOT NULL;
This re-encodes the value a few times, generating the “repaired” text, then overwrites the stored value. If the conversion fails, because the value is already good and thefore can't be “more fixed,” it generates NULL
, so the WHERE
clause filters out any records where this failure might occur.
Of course, a variation of this query had to be created for every column of every table which might be allowed to store multi-byte characters. (Fields like email address ought to be restricted to ASCII
characters.)
The migration script worked fine for the production database, but after repairing the characters I still need to migrate the staging database to its new server.
Another thing I figured out is that the database dump text file was being generated as utf8mb4
encoded as latin1
, encoded as utf8mb4
, which is why the text imported as garbage no matter what the content of the queries looked like. The workaround I found was to force the database dump to output latin1
:
> mysqldump --defaults-extra-file=db_connection.cnf --databases webapp_prod --no-tablespaces --hex-blob --default-character-set=latin1 > dbdump.sql
(I used db_connection.cnf
as a configuration file containing connection parameters, e.g. host, username, and password. --no-tablespaces
was used because my user didn't have permission to dump tablespaces (which weren't needed anyway), and --hex-blob
was used to force any binary data to hexidecimal.)
Running this dump's queries on the new server successfully imported records without any garbage text.
Note that if your MySQL server resides on AWS, restrictions on your user permissions may require you to comment out or delete the following lines from your SQL dump, depending on if your database client will halt on the errors they generate or continue through them:
SET @MYSQLDUMP_TEMP_LOG_BIN = @@SESSION.SQL_LOG_BIN;
SET @@SESSION.SQL_LOG_BIN= 0;
SET @@GLOBAL.GTID_PURGED=/*!80000 '+'*/ '';
SET @@SESSION.SQL_LOG_BIN = @MYSQLDUMP_TEMP_LOG_BIN;
It gets tricky with MariaDB
Now what if your database uses MariaDB, which is a derivative of MySQL? Well, my development database[1] just happened to be using MariaDB, and it too had a bunch of garbage. Unfortunately I could not use the repair queries without modification, as MariaDB will just power through those failed conversions and turn unencodable characters into ?
instead of generating NULL
. Well, it mostly powers through . . .
Because a failed conversion generating NULL
wasn't reliable in MariaDB, I built a SELECT
query that tested for the original value containing more characters than the converted value. As the garbled text was a multi-byte character broken out into the character representations of each of those bytes, logically the conversion would shrink the character length:
mysql> SELECT id,
CHAR_LENGTH(`notes`) AS L1,
CHAR_LENGTH(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4)) AS L2,
`notes`,
CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4) "new_text"
FROM `contacts`
WHERE CHAR_LENGTH(`notes`) > CHAR_LENGTH(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4));
That worked great! It listed only records which needed correction. So I adapted my UPDATE
query to use the same filter:
mysql> UPDATE `contacts`
SET `notes` = CONVERT(CAST(CONVERT(`notes` USING latin1) AS BINARY) USING utf8mb4)
WHERE CHAR_LENGTH(`notes`) > CHAR_LENGTH(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4));
Here's where “mostly powers through” comes in. It seems as if MariaDB processes and tests writing generated values in the SET
statement before fully applying the filter from the WHERE
statement, as I got errors like “Cannot convert 'utf8mb4' character 0xE382B3 to 'latin1'” on records that should have been filtered out. I found that if I added AND id > X AND id < Y
to the WHERE
clause so that the range of ID
s included only applicable records, it worked fine. If I instead tried to use a subquery to get only applicable ID
s and put them in AND id IN([subquery fetching only applicable IDs])
, that failed too.
The Final Step
After I built and ran a migration script to repair the garbled text in my databases, there was still one final fix to be made: The PHP application needed to be configured to explicitly communicate with the database using utf8mb4
. The project used Phalcon as its framework, and the solution turned out to be as simple as adding the following line to the configuration array wherever the database connection was initialized (which was two places in that project):
'charset' => 'utf8mb4'
Before applying that fix however, I queried the character sets in-application:
$result = $this->di->get('db')->query("select @@character_set_client, @@character_set_connection, @@character_set_results;")->fetchAll();
which returned latin1
in all three cases. After adding that one line to the connection setup, I got the expected utf8mb4
.
Further tests found the web application to be working correctly without anymore garbage generation.
Additional Test Queries
Here are additional queries I found useful for testing in this situation.
-- Count all rows where field has damaged encoding SELECT COUNT(*) from `contacts` where HEX(`notes`) <> HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4)) -- Show all rows that have damaged encoding SELECT `id`, `notes`, CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4) "new_text" from `contacts` where HEX(`notes`) <> HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4)) -- Count potentially damaged rows that cannot be repaired SELECT COUNT(*) from `contacts` where HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4)) IS NULL -- Show rows where field cannot be repaired (because probably not actually damaged) SELECT `id`, `notes`, CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4) "new_text" from `contacts` where HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4)) IS NULL -- Count all REPAIRABLE rows that are damaged SELECT COUNT(*) from `contacts` where HEX(`notes`) <> HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4)) AND HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4)) IS NOT NULL -- Show all REPAIRABLE rows that are damaged SELECT `id`, `notes`, CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4) "new_text" from `contacts` where HEX(`notes`) <> HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4)) AND HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4)) IS NOT NULL -- View constrants and how much storage is being spent on each SELECT col.column_type, col.character_octet_length, kcu.* FROM information_schema.key_column_usage AS kcu JOIN information_schema.columns AS col ON kcu.table_schema = col.table_schema AND kcu.table_name = col.table_name AND kcu.column_name = col.column_name WHERE col.column_type LIKE '%char%'
- ⌃ The fact that the database servers were not all identical systems may stick out like a sore thumb to you. Fact of the matter is that some issues have been caused by some shady developers the client had hired in the past, including a Russian who was impersonating an American, and some dude who didn't understand how to use
git
and was making file changes directly on the repository. I'm still ironing out the wrinkles.