We spent couple of hours trying to figure out this one, so we decided to share it with our fellow readers and hopefully, save some time and unnecessary stress for them. UTF-8 is defacto standard encoding for multilanguage websites. PHP5 doesn’t have native support for it, but you can integrate it. MYSQL supports utf8, but not utf8_croatian_collation. But you can also hack your way around that for sorting. If you are not dealing with Croatian letters and UTF-8, you probably won’t encounter this problem. However, if you ARE using UTF-8 read on!
We are redesigning one large website and naturally, some content from old website has to be migrated to new site. Since we don’t have access to their database, we have to manually copy content from old website’s HTML source. Client’s old site is in UTF-8. Imagine following entry in old HTML source:
[ftf w=”480″ h=”105″][/ftf]
This is a select tag with list of cities in Croatia, the city in example is Đakovo. So as I said, we are copy-ing large amounts of data from old HTML’s, and we do not care why old site displays this city as “Ðakovo” and not as “Đakovo”. After we pasted “Ðakovo” into our engine, we momentarily started to get all sorts of different errors. The most bizare one was comming from our utf8_strtolower function which had to return string “đakovo”. Instead it was returning string “ðakovo”. Wicked!
After some digging, we found out that there was nothing wrong with our utf8 php methods nor our mysql database. We found out that the letter Ð (hex d0 00) we pasted from HTML was just not the letter Đ (hex 10 01) we needed. Somebody (or something) very brilliant, used letter Eth instead of standard Croatian letter D with stroke.
Eth (Ð, ð; also spelled edh or eð) is a letter used in Old English, Icelandic, Faroese (in which it is called edd). In the Unicode universal character encoding standard, upper and lower case eth are represented by U+00D0 and U+00F0. These code points are inherited from the older ISO 8859-1 standard. In HTML, eth is represented by the Latin character entities Ð and ð.
Đ (lowercase đ) is a letter of the Latin alphabet, formed from D with the addition of a bar or stroke through the letter. This is the same modification that was used to create eth (ð), but eth is based on an insular variant of d while đ is based on its usual upright shape. In Unicode the letter is represented as U+0110 LATIN CAPITAL LETTER D WITH STROKE and U+0111 LATIN SMALL LETTER D WITH STROKE.
So there you have it. No matter the letters look the same, beware what you copy/paste. :)