WordPress MySQL Database Encoding Fun

I’ve finished moving thesamet.com to a new web host. The most tedious thing to do was to move the WordPress SQL database to the new server.

It turns out that the database was created with latin1 character set while WordPress pushed utf8 encoded data to it. For those who got stuck in the same problem, here is what I did to transfer the data to a new server.

First, get a sql dump of your database. Use the encoding of the database when doing it (i.e. latin1). Here is how to do it from the command line:

$ mysqldump –default-character-set=latin1 -u db_user -p db_name > mydb.sql

After inspecting the dump, it turned out that this file could be considered as UTF8 encoded. this was to be expected as we had UTF8 data in the database considered as latin1 and we dumped it as latin1.

Python is known to have great support for handling Unicode, and is very comfortable to inspect strings with it. The next thing I did was to load the file with Python, and tried to decode the data as utf8.

>>> s = open(‘mydb.sql’).read()
>>> s.decode(‘utf8’)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: utf8 codec can’t decode byte \x92 in position 561129: unexpected code byte</stdin>

Great, so the first trouble comes at index 561129. Let’s see what’s in there:

>>> s[561110:561136]
‘ points of view, it\x92s good’

That \x92 looks suspicious. It turns out that this one is still encoded in latin1, and \x92 in unicode is some kind of ‘ (slanted single-quote or something). To translate it to UTF8, I did:

>>> \x92′.decode(‘latin1’).encode(‘utf8’)
\xc2\x92′

Now changing all occurrences of \x92 to \xc2\x92, can change it in places we would not want. So before we do that, we check what the other occurrences of \x92 looks like:

>>> [s[i-5:i+5] for i,ch in enumerate(s) if ch==\x92′]
[\xd7\x9c\xd7\x95\xd7\x92… ‘, ‘w, it\x92s go’]

We see that the first occurrence of \x92 should not be changed, it is within a UTF8 string. So we are safe to do

>>> s=s.replace(‘it\x92s’, ‘it\xc2\x92s’)
>>> u8=s.decode(‘utf8’)

And this time the decoding succeeded. This means that our data contains valid UTF8. My case was easy, but you might have to repeat the above process several times until you get rid of all latin1 characters in your data. It might be possible that you’ll have latin1 data that looks like utf8, and you’ll need to do even more manual search/replace.

Now write the utf8-encoded string (s, not u8) to a new file:

>>> open(‘/tmp/out.sql’, ‘w’).write(s)

After that, I’ve changed all the CREATE statements in the SQL file from CHARSET=latin1 to CHARSET=latin1.

It is important to add SET NAMES=utf8; at the top of the file.

On the destination server I created the database with UTF8 default encoding:

CREATE DATABASE db_name DEFAULT CHARSET utf8;

and imported the new SQL file into the database:

mysql –default-character-set=utf8 -u db_user -p db_name < /tmp/out.sql

and voila, it works.

This entry was posted in daily. Bookmark the permalink.

6 Responses to WordPress MySQL Database Encoding Fun

  1. Rashad says:

    Expect that the psychic will make the decisions for you.
    It is a gift and ought to be used properly and only for the good, so usually use
    your present in a positive way. Keep that in mind if your feelings
    are harm.

    my website :: online psychic (Rashad)

  2. If you aren’t home much, have children under the age of ten,
    don’t enjoy chaos, or have a busy lifestyle which would
    prevent a ferret from roaming freely, then these pet are not for you.

    Not only that, but what happens is what is supposed to happen for that ferret.
    Spa Presents – Take care of her to a day at a beauty spa exactly where she can loosen up and rejuvenate.

  3. Pingback: Python:Compile Syntax Error: non ASCII letters in a string – IT Sprite

  4. benidorm says:

    Wonderful beat ! I would like to apprentice while you amend your website, how
    can i subscribe for a weblog website? The account aided me a appropriate deal.

    I have been a little bit familiar of this your broadcast offered shiny
    transparent idea

  5. discos ssd says:

    he estado navegando en línea más de 2 horas hoy, pero nunca encontré ningún artículo interesante como el tuyo.
    Es Vale bastante lo suficiente para mí. En mi opinión, si todos propietarios
    de sitios y bloggers hacen un buen contenido como tú,
    Hey ¿Te importaría decirme qué hosting company estás trabajando con ?
    He cargado tu blog en 3 completamente diferentes navegadores y debo decir que este blog carga mucho más rápido
    que la mayoría. ¿Puede sugerir un buen proveedor de
    hospedaje de Internet a un precio razonable ?

    Felicitaciones , ¡lo aprecio!
    Me encanta cuando personas se reúnan y compartan pensamientos .
    Estupendo sitio, continúen ! mantener el fantástico trabajo !

  6. Wang Zirui led Xinjiang with 22.As much as 1,700 millimeters of rain was recorded at one weather station in the week to Monday. He also said that lithium batteries were environmentally friendly. The most effective plan for achieving it’s to get boxes of the same size if possible. TIANJIN,saucony shadow uomo offerte, Sept.Authorities were releasing water from seven reservoirs in line with a plan to limit flooding,scarpe saucony offerta, the disaster committee said.By Jon Day, Zhu Chao, Liu

Leave a Reply

Your email address will not be published. Required fields are marked *