すみませんが、いまはこの記事が日本語で不可能選択です。

MySQL Character Encoding Repair

投稿されました:

Garbage Text Discovered

I was recently forced to upgrade a client's MySQL server from 5.7 to 8.0 as Amazon was about to start charging extra for using an outdated version.  Using the automated AWS upgrade went smoothly for the production server, but it failed to upgrade the staging server due to the pre-patch compatibility test failing.  I decided that the easiest fix would be to just create a new server already running MySQL 8.0 and import the database via SQL dump.

Doing so seemed to go smoothly, but then while testing the staged upgrade it was noticed that the multi-byte charcters were showing up as garbage text.  When I first joined the project there was already a lot of mis-encoded “garbage” characters in the records from what I had assumed was users copy-pasting funky OCR text, but when I took a closer look I found that recently created records were garbled too.  This was easy to miss as most of the multi-byte characters were just angled single-quotemarks and em dashes input by MacOS users.

The short of it is that the MySQL server had originally been set up to use utf8mb3 character encoding (then recently upgraded to utf8mb4), while the PHP application that used it did not have encoding explicitly configured anywhere, apparently defaulting to latin1.  The result was that the text was being stored incorrectly, but the database and the application were synchronized such that two wrongs actually made a right:  While new text was translated into garbage for storage, which showed up as garbage in database clients, it was translated back into non-garbage when viewed in the website.

I will now show you how we figured this out, and how we ultimately fixed it.

The Investigation

When investigating the character encoding of your MySQL database, one of the first things you should look at is its character set configurations.  Use this query to do so:

mysql> SELECT @@character_set_client, @@character_set_connection, @@character_set_results;

Using DBeaver as my desktop SQL client, the results I got were, utf8mb4, utf8mb4, NULL.  This is somewhat correct, as we want all three reults to be utf8mb4.  This was a bit misleading however as I got different results when I queried from a MySQL command line terminal:  latin1, latin1, latin1.  This should have been my first clue, but unfornately it wasn't until late into the investigation that I looked into this.

For a control case, I used “Nihongo: 日本語” as I could quickly locate “Nihongo” and see whether “日本語” remained intact.  When this text was garbled, 日本語 displayed as garbage characters 日本語.  It looked fine in the web application but looked like garbage in desktop clients.  I tried exporting the database various ways, but the dumped queries seemed to always be corrupt:  Either they looked correct at import but turned into garbage it import, or they exported as garbage and remained garbage when imported.

Much thanks to my co-worker T.R. who focused on to figure out the specifics of the bungled character encoding:

  • PHP had been unknowingly storing UTF-8 strings as latin1 in MySQL.  So PHP would send (UTF-8 E8 AA 9E) . . .
  • . . . and MySQL would store it in latin1 as 語.
  • When the columns were changed to utf8mb4, 語 was transcoded again into UTF-8:  C3 A8 C2 AA C2 9E.

Therefore:

  • the database was storing as C3 A8 C2 AA C2 9E . . .
  • . . . and because PHP did not specify any preferred character set for the MySQL connection, MySQL would transcode it back to latin1 when sending it to PHP as 語 (E8 AA 9E) . . .
  • . . . and then PHP would iterpret it as UTF-8:  (E8 AA 9E).

T.R. came up with this proof-of-concept query (and I cleaned it up to correct deprecated syntax) with “mojibake” to stand in as the “character monster”:

mysql>  SET @a := CONVERT((
            SELECT mojibake
            FROM
            (
              SELECT _utf8mb4 '語' AS mojibake
            ) AS _1
          ) USING latin1);
        SET @b = convert(cast(@a as binary) using utf8mb4);
        SET NAMES utf8mb4;
        SELECT  @@character_set_client "client",
                @@character_set_connection "connection",
                @@character_set_results "results",
                _1.mojibake,
                HEX(_1.mojibake) "mojibake_as_hex",
                
                @a "as_latin1",
                HEX(@a) "latin1_hex",
                
                @b "latin1_as_utf8",
                HEX(@b) "utf8_hex",
                CONVERT(CAST(CONVERT(_1.mojibake using latin1) AS BINARY) USING utf8mb4) "condensed"
        FROM (
          SELECT _utf8mb4 '語' AS mojibake
        ) AS _1

Which resulted in:

Name Value As Hex
client utf8mb4
connection utf8mb4
results utf8mb4
mojibake 語 C3A8C2AAC5BE
as_latin1 語 E8AA9E
latin1_as_utf8 E8AA9E
condensed E8AA9E

Solutions

To repaire the characters I adapted T.R.'s encoding test query into an UPDATE query to do so.  This was the general form:

mysql> UPDATE `contacts`
       SET `notes` = CONVERT(CAST(CONVERT(`notes` USING latin1) AS BINARY) USING utf8mb4)
       WHERE HEX(CONVERT(CAST(CONVERT(`notes` USING latin1) AS BINARY) USING utf8mb4)) IS NOT NULL;

This re-encodes the value a few times, generating the “repaired” text, then overwrites the stored value.  If the conversion fails, because the value is already good and thefore can't be “more fixed,” it generates NULL, so the WHERE clause filters out any records where this failure might occur.

Of course, a variation of this query had to be created for every column of every table which might be allowed to store multi-byte characters.  (Fields like email address ought to be restricted to ASCII characters.)

The migration script worked fine for the production database, but after repairing the characters I still need to migrate the staging database to its new server.

Another thing I figured out is that the database dump text file was being generated as utf8mb4 encoded as latin1, encoded as utf8mb4, which is why the text imported as garbage no matter what the content of the queries looked like.  The workaround I found was to force the database dump to output latin1:

> mysqldump --defaults-extra-file=db_connection.cnf --databases webapp_prod --no-tablespaces --hex-blob --default-character-set=latin1 > dbdump.sql

(I used db_connection.cnf as a configuration file containing connection parameters, e.g. host, username, and password.  --no-tablespaces was used because my user didn't have permission to dump tablespaces (which weren't needed anyway), and --hex-blob was used to force any binary data to hexidecimal.)

Running this dump's queries on the new server successfully imported records without any garbage text.

Note that if your MySQL server resides on AWS, restrictions on your user permissions may require you to comment out or delete the following lines from your SQL dump, depending on if your database client will halt on the errors they generate or continue through them:

  • SET @MYSQLDUMP_TEMP_LOG_BIN = @@SESSION.SQL_LOG_BIN;
  • SET @@SESSION.SQL_LOG_BIN= 0;
  • SET @@GLOBAL.GTID_PURGED=/*!80000 '+'*/ '';
  • SET @@SESSION.SQL_LOG_BIN = @MYSQLDUMP_TEMP_LOG_BIN;

It gets tricky with MariaDB

Now what if your database uses MariaDB, which is a derivative of MySQL?  Well, my development database[1] just happened to be using MariaDB, and it too had a bunch of garbage.  Unfortunately I could not use the repair queries without modification, as MariaDB will just power through those failed conversions and turn unencodable characters into ? instead of generating NULL.  Well, it mostly powers through . . .

Because a failed conversion generating NULL wasn't reliable in MariaDB, I built a SELECT query that tested for the original value containing more characters than the converted value.  As the garbled text was a multi-byte character broken out into the character representations of each of those bytes, logically the conversion would shrink the character length:

mysql>  SELECT  id,
                CHAR_LENGTH(`notes`) AS L1,
                CHAR_LENGTH(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4)) AS L2,
                `notes`,
                CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4) "new_text"
        FROM    `contacts`
        WHERE   CHAR_LENGTH(`notes`) > CHAR_LENGTH(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4));

That worked great!  It listed only records which needed correction.  So I adapted my UPDATE query to use the same filter:

mysql> UPDATE `contacts`
       SET `notes` = CONVERT(CAST(CONVERT(`notes` USING latin1) AS BINARY) USING utf8mb4)
       WHERE CHAR_LENGTH(`notes`) > CHAR_LENGTH(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4));

Here's where “mostly powers through” comes in.  It seems as if MariaDB processes and tests writing generated values in the SET statement before fully applying the filter from the WHERE statement, as I got errors like “Cannot convert 'utf8mb4' character 0xE382B3 to 'latin1'” on records that should have been filtered out.  I found that if I added AND id > X AND id < Y to the WHERE clause so that the range of IDs included only applicable records, it worked fine.  If I instead tried to use a subquery to get only applicable IDs and put them in AND id IN([subquery fetching only applicable IDs]), that failed too.

The Final Step

After I built and ran a migration script to repair the garbled text in my databases, there was still one final fix to be made:  The PHP application needed to be configured to explicitly communicate with the database using utf8mb4.  The project used Phalcon as its framework, and the solution turned out to be as simple as adding the following line to the configuration array wherever the database connection was initialized (which was two places in that project):

'charset' => 'utf8mb4'

Before applying that fix however, I queried the character sets in-application:

$result = $this->di->get('db')->query("select  @@character_set_client, @@character_set_connection, @@character_set_results;")->fetchAll();

which returned latin1 in all three cases.  After adding that one line to the connection setup, I got the expected utf8mb4.

Further tests found the web application to be working correctly without anymore garbage generation.

Additional Test Queries

Here are additional queries I found useful for testing in this situation.

-- Count all rows where field has damaged encoding
SELECT  COUNT(*)
from    `contacts`
where   HEX(`notes`) <> HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4))

-- Show all rows that have damaged encoding
SELECT  `id`,
        `notes`,
        CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4) "new_text"
from    `contacts`
where   HEX(`notes`) <> HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4))



-- Count potentially damaged rows that cannot be repaired
SELECT  COUNT(*)
from    `contacts`
where   HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4)) IS NULL

-- Show rows where field cannot be repaired (because probably not actually damaged)
SELECT  `id`,
        `notes`,
        CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4) "new_text"
from    `contacts`
where   HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4)) IS NULL



-- Count all REPAIRABLE rows that are damaged
SELECT  COUNT(*)
from    `contacts`
where   HEX(`notes`) <> HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4))
        AND HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4)) IS NOT NULL

-- Show all REPAIRABLE rows that are damaged
SELECT  `id`,
        `notes`,
        CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4) "new_text"
from    `contacts`
where   HEX(`notes`) <> HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4))
        AND HEX(CONVERT(CAST(CONVERT(`notes` using latin1) AS BINARY) using utf8mb4)) IS NOT NULL


-- View constrants and how much storage is being spent on each
SELECT  col.column_type,
        col.character_octet_length,
        kcu.*
FROM    information_schema.key_column_usage AS kcu
        JOIN information_schema.columns AS col
        ON   kcu.table_schema = col.table_schema
        AND  kcu.table_name = col.table_name
        AND  kcu.column_name = col.column_name
WHERE   col.column_type LIKE '%char%'
  1. The fact that the database servers were not all identical systems may stick out like a sore thumb to you.  Fact of the matter is that some issues have been caused by some shady developers the client had hired in the past, including a Russian who was impersonating an American, and some dude who didn't understand how to use git and was making file changes directly on the repository.  I'm still ironing out the wrinkles.