This one is a keeper, have to write it down somewhere in case I forget it by tomorrow morning. :) Today I was doing routine deploy of one of our new websites to production server. Everything went smoothly, except the fact I couldn't see any of Croatian letters on the site texts (which were stored in mysql)!! We did migration to UTF8 some time ago, and this was highly unexpected. So I started to dig into the dump. :)
Looking at the mysqldump of a database I was a bit surprised to see Croatian letters screwed up beyond recognition. For example letter "š" was represented as "ÄąÄO„" (4 bytes), "č" was "ÄO" (two bytes)... etc. OMG! All our tables are "CHARSET=utf8", we force "AddDefaultCharset utf-8" in Apache and all of our markup uses "charset=utf-8" for content type encoding. BUT, we unintentionally left one very bad call in our config file which got triggered if site was running on development server - "SET NAMES 'latin2' COLLATE 'latin2_croatian_ci". The production config had SET NAMES utf8, but development didn't. Oh boy.
Ok so now I had a real mess - utf8 tables with utf8 data in them stored as re-encoded latin2 but in utf8!? I tried converting the dump from ISO8859-2 to UTF8 but that just made things worse, some of characters now used 6 bytes which isn't good. The solution was pretty straight forward. I converted the dump from UTF8 to ISO-8859-2 and I got real UTF8 again. I imported the converted dump and changed for good config files to "SET NAMES 'utf8' COLLATE 'utf8_general_ci".
iconv to the rescue:
mysqldump -u root -p db > weirdo-dump.sql iconv -f UTF-8 -t ISO_8859-2//TRANSLIT weirdo-dump.sql > dump-latin2-aka-realUTF8.sql mysql -u root -p db < dump-latin2-aka-realUTF8.sql