WordPress MySQL Database Encoding Fun
Apr 8th, 2007 by thesamet
I’ve finished moving thesamet.com to a new web host. The most tedious thing to do was to move the WordPress SQL database to the new server.
It turns out that the database was created with latin1 character set while WordPress pushed utf8 encoded data to it. For those who got stuck in the same problem, here is what I did to transfer the data to a new server.
First, get a sql dump of your database. Use the encoding of the database when doing it (i.e. latin1). Here is how to do it from the command line:
After inspecting the dump, it turned out that this file could be considered as UTF8 encoded. this was to be expected as we had UTF8 data in the database considered as latin1 and we dumped it as latin1.
Python is known to have great support for handling Unicode, and is very comfortable to inspect strings with it. The next thing I did was to load the file with Python, and tried to decode the data as utf8.
>>> s.decode(’utf8′)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: utf8 codec can’t decode byte \x92 in position 561129: unexpected code byte</stdin>
Great, so the first trouble comes at index 561129. Let’s see what’s in there:
‘ points of view, it\x92s good’
That \x92 looks suspicious. It turns out that this one is still encoded in latin1, and \x92 in unicode is some kind of ‘ (slanted single-quote or something). To translate it to UTF8, I did:
‘\xc2\x92′
Now changing all occurrences of \x92 to \xc2\x92, can change it in places we would not want. So before we do that, we check what the other occurrences of \x92 looks like:
[‘\xd7\x9c\xd7\x95\xd7\x92… ‘, ‘w, it\x92s go’]
We see that the first occurrence of \x92 should not be changed, it is within a UTF8 string. So we are safe to do
>>> u8=s.decode(‘utf8′)
And this time the decoding succeeded. This means that our data contains valid UTF8. My case was easy, but you might have to repeat the above process several times until you get rid of all latin1 characters in your data. It might be possible that you’ll have latin1 data that looks like utf8, and you’ll need to do even more manual search/replace.
Now write the utf8-encoded string (s, not u8) to a new file:
After that, I’ve changed all the CREATE statements in the SQL file from CHARSET=latin1 to CHARSET=latin1.
It is important to add SET NAMES=utf8; at the top of the file.
On the destination server I created the database with UTF8 default encoding:
and imported the new SQL file into the database:
and voila, it works.
No comments yet.