WordPress MySQL Database Encoding Fun

I’ve finished moving thesamet.com to a new web host. The most tedious thing to do was to move the WordPress SQL database to the new server, that is why I decided to get a best wordpress hosting to improve the performance.

It turns out that the database was created with latin1 character set while WordPress pushed utf8 encoded data to it. Visit collectiveray.com for more info. For those who got stuck in the same problem, here is what I did to transfer the data to a new server.

First, get a sql dump of your database. Use the encoding of the database when doing it (i.e. latin1). Here is how to do it from the command line:

$ mysqldump --default-character-set=latin1 -u db_user -p db_name > mydb.sql

After inspecting the dump, it turned out that this file could be considered as UTF8 encoded. this was to be expected as we had UTF8 data in the database considered as latin1 and we dumped it as latin1.

Python is known to have great support for handling Unicode, and is very comfortable to inspect strings with it. The next thing I did was to load the file with Python, and tried to decode the data as utf8.

>>> s = open('mydb.sql').read()
>>> s.decode('utf8')
Traceback (most recent call last):
  File "", line 1, in <module>
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: utf8 codec can't decode byte \x92 in position 561129: unexpected code byte

Great, so the first trouble comes at index 561129. Let’s see what’s in there:

>>> s[561110:561136]
' points of view, it\x92s good'

That \x92 looks suspicious. It turns out that this one is still encoded in latin1, and \x92 in unicode is some kind of ‘ (slanted single-quote or something). To translate it to UTF8, I did:

>>> '\x92'.decode('latin1').encode('utf8')
'\xc2\x92'

Now changing all occurrences of \x92 to \xc2\x92, can change it in places we would not want. So before we do that, we check what the other occurrences of \x92 looks like:

>>> [s[i-5:i+5] for i,ch in enumerate(s) if ch=='\x92']
['\xd7\x9c\xd7\x95\xd7\x92... ', 'w, it\x92s go']

We see that the first occurrence of \x92 should not be changed, it is within a UTF8 string. So we are safe to do

>>> s=s.replace('it\x92s', 'it\xc2\x92s')
>>> u8=s.decode('utf8')

And this time the decoding succeeded. This means that our data contains valid UTF8. My case was easy, but you might have to repeat the above process several times until you get rid of all latin1 characters in your data. It might be possible that you’ll have latin1 data that looks like utf8, and you’ll need to do even more manual search/replace.

Now write the utf8-encoded string (s, not u8) to a new file:

>>> open('/tmp/out.sql', 'w').write(s)

After that, I’ve changed all the CREATE statements in the SQL file from CHARSET=latin1 to CHARSET=latin1.

It is important to add SET NAMES=utf8; at the top of the file.

On the destination server I created the database with UTF8 default encoding:

CREATE DATABASE db_name DEFAULT CHARSET utf8;

and imported the new SQL file into the database:

mysql --default-character-set=utf8 -u db_user -p db_name < /tmp/out.sql

and voila, it works.

I also start searching more details about different databases including NoSQL database.

This entry was posted in daily. Bookmark the permalink.

99 Responses to WordPress MySQL Database Encoding Fun