mysql utf8 latin2 utf8 weirdness

Author: seven September 10, 2007

This one is a keeper, have to write it down somewhere in case I forget it by tomorrow morning. :) Today I was doing routine deploy of one of our new websites to production server. Everything went smoothly, except the fact I couldn't see any of Croatian letters on the site texts (which were stored in mysql)!! We did migration to UTF8 some time ago, and this was highly unexpected. So I started to dig into the dump. :)

Looking at the mysqldump of a database I was a bit surprised to see Croatian letters screwed up beyond recognition. For example letter "š" was represented as "ÄąÄO„" (4 bytes), "č" was "čO" (two bytes)... etc. OMG! All our tables are "CHARSET=utf8", we force "AddDefaultCharset utf-8" in Apache and all of our markup uses "charset=utf-8" for content type encoding. BUT, we unintentionally left one very bad call in our config file which got triggered if site was running on development server - "SET NAMES 'latin2' COLLATE 'latin2_croatian_ci". The production config had SET NAMES utf8, but development didn't. Oh boy.

Ok so now I had a real mess - utf8 tables with utf8 data in them stored as re-encoded latin2 but in utf8!? I tried converting the dump from ISO8859-2 to UTF8 but that just made things worse, some of characters now used 6 bytes which isn't good. The solution was pretty straight forward. I converted the dump from UTF8 to ISO-8859-2 and I got real UTF8 again. I imported the converted dump and changed for good config files to "SET NAMES 'utf8' COLLATE 'utf8_general_ci".
iconv to the rescue:

mysqldump -u root -p db > weirdo-dump.sql
iconv -f UTF-8 -t ISO_8859-2//TRANSLIT weirdo-dump.sql > dump-latin2-aka-realUTF8.sql
mysql -u root -p db < dump-latin2-aka-realUTF8.sql

CEO/CTO at Nivas®
Neven Jacmenović has been passionately involved with computers since late 80s, the age of Atari and Commodore Amiga. As one of internet industry pioneers in Croatia, since 90s, he has been involved in making of many award winning, innovative and successful online project. He is experienced full stack web developer, analyst and system engineer. In his spare time, Neven is transforming retro-futuristic passion into various golang, Adobe Flash and JavaScript/WebGL projects.

    3 thoughts on “mysql utf8 latin2 utf8 weirdness”

  • This weirdness is definitely a bug which was also filed as S2 severity bug at MySQL but closed ( see ) and unfortunately we can see the same happening in MySQL 5 too.

    It seems to me that MySQL is not paying much attention to the multibyte support.

  • The good thing is – I didn’t loose any high bit data. My data got encoded twice, so Instead of 2 bytes some chars used 4. :)

    mysql4 was a nightmare tough. we are waiting for 5.1 and ndb engine.

  • Your post saved my ass. THX ;)

  • Leave a Reply

    Your email address will not be published. Required fields are marked *