“Imamo Hrvatsku!” – MySQL patch which implements full Croatian ordering in utf8_croatian_ci and ucs2_croatian_ci collations

Great news, great news indeed. Couple of months ago, I started an open initiative to finally add support to MySQL for proper ordering using Croatian alphabet. We tried doing it on our own, but we needed to rewrite MySQL's Unicode Collation Algorithm, and for that we really needed help from MySQL development team. How we managed to get it? Using good old "Balkan way" - the schnapps aka. rakija black vodka. :)

My mate who was working with me on our initial implementation - Ante 'Ivoks' Karamatić (Chief executive at Init) got drunk with Kurt von Finck (Chief Community and Communications Officer for Monty Program Ab) in Dallas last week, who passed a good word to Michael ("Monty") Widenius (MySQL's original author and co-founder of MySQL AB) to listen our cries for help. Monty convinced Alexander Barkov (Lead software developer at Sun Microsystems working on MySQL) to give us little help on whole Croatian ordering issue. As a result, utf8_croatian_ci and ucs2_croatian_ci collations were created and added to MySQL 6.

After a pleasant chat with Monty and Bar, they were good enough to help us with a MySQL 5.1 patch which implements full Croatian ordering in utf8_croatian_ci and ucs2_croatian_ci collations. Woohoo! :)

But the bad news is that it will take fair amount of time before MySQL-5.6 (or 6.0 for that matter) will go GA, so one have to wait before it will be possible to download a production version of MySQL with "real" Croatian support.

If you really need Croatian support, you can try patching MySQL server as we did.

More details about the patch can be found here:

Since Alexander Barkov was so kind and provided a patch for MySQL 5.1, Ante created packages for Ubuntu. He also slightly (needs further testing) modified that patch so it works with MySQL 5.0. If you need this feature, go add this PPA to your sources.list: https://edge.launchpad.net/~ivoks/+archive/mysql-hr/.

After you apply the patch, you can try it out using my test database dump. If everything went ok "use croatian; SET NAMES 'utf8' COLLATE 'utf8_croatian_ci'; select rijec from test_croatian order by rijec;", should produce output like this (switch browser view to utf8).

Any feedback from the Croatian MySQL community is greatly welcomed. Please write your comments to <Alexander.Barkov[at]Sun.COM>. Thanks!

Proof of conecpet:

mysql> select version();
| version()       |
| 5.0.51a-hr1-log | 
mysql> use croatian; SET NAMES 'utf8' COLLATE 'utf8_croatian_ci'; select rijec from test_croatian order by rijec;
|  rijec       |
| Aboriđin     |
| Aboriđini    |
| Ante         |
| Branimir     |
| Cipela       |
| Čazma        |
| Ćevapčići    |
| Džak         |
| džak         |
| Džamija      |
| džamija      |
| Đak          |
| đak          |
| Đevđelija    |
| Inat         |
| Init         |
| Inozemstvo   |
| Interes      |
| Injekcija    |
| Ipsilon      |
| Kutina       |
| Livno        |
| Lovor        |
| Ljubav       |
| Ljubljana    |
| Neven        |
| Nivas        |
| Nosorog      |
| Njivice      |
| Onomatopeja  |
| Šišmiš       |
| Zagreb       |
| Žaba         |

How to disable mysql fulltext stopwords?

Easy! :) A while ago, I ranted about fulltext stopwords in mysql.

To repeat - the stopword list is a list of most used words in english language. Common words such as “some”, "little", "let" or “then” are stopwords and do not match if present in the search string. Basically, fulltext searching for any of the stopwords would return (almost) all the entries, so MySQL ignores those words to reduce result pollution and for efficiency. If the word starts with stopword, that word is returned, however if your word IS stopword, you are out of luck. Here is a list of build in stopwords in MySQL 5.0/5.1. Yeah, well this is so very exciting, but whole universe DOES NOT speak and write just english.

Last time we had problems with stopwords, there was no way to disable them without recompiling the mysql engine from the source. On production servers where you need to stay away from recompiling custom builds for maximum compatibility that was no go.

Today we again encountered stopword problem. Client complained that he cannot find video from artist "Let 3" or artist "Little Boots" (Victoria Hesket). To my surprise, in no time - I've found out a prefectly legal and easy to apply solution. Just add this to your .cnf file, restart mysql engine and rebuild indexes:

ft_stopword_file = ""

I addition to that, if you haven't allready, lower the min word to 3 (ft_min_word_len=3) to be able to search for 3 letter words.

Happy searching! :)

You have an error in your SQL syntax … USING BTREE

If you see error similar to this one on your MySQL 5.0 while trying to import mysql dump exported on MySQL 5.1 (with mysqldump):
ERROR 1064 (42000) at line 417: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'USING BTREE

...you should upgrade to latest mysql. Bummer!

This is old and confirmed bug #25162. The problem was in incompatible syntax for key definition in CREATE TABLE. MySQL 5.0 supports only this syntax:
{INDEX|KEY} [index_name] [index_type] (index_col_name,...)
while MySQL 5.1 preferrs this one :
{INDEX|KEY} [index_name] (index_col_name,...) [index_type]
..which lead to incompatible dump.

We have this situation - workstations run MySQL v5.1.33 and our development and production servers are still on 5.0.51a-3ubuntu5.4 (Ubuntu LTS) version. We can't upgrade MySQL on servers that easily so I first tried updating my workstation to latest windows version 5.1.37. mysqldump which comes with it is v10.13 Distrib 5.1.37, and it produced exactly the same incompatible output. I even tried x86 and x64 versions - with same incompatible results. I submited a bug report to MySQL because it seems that this bug was fixed in Windows installer version but not in without-installer version which I used to upgrade my local installation (bug #46598).

This was going nowhere so I downgraded my workstation to MySQL 5.0.84. I was then able to do proper export from my workstation and import on our development server. I have no idea why mysqldump in 5.1 version didn't get --compatible=50 switch option. If somebody knows a workaround for this, do share!

If you cant' upgrade, try exporting from 5.1 with --compatible=mysql40 switch but doing so will heavily downgrade your dump so beware.

MySQL feature request/proposition: Croatian utf8 collation (utf8_croatian_ci)

We use MySQL database for pretty much everything now days. It's de-facto standard for horizontally scaled web sites and it's used by biggest players in the industry. But one thing that is lacking, and which is very important for our regional market is proper Croatian collation support for utf8 charsets. Without it, MySQL server can't be considered choice for eg. government migration to open-source platform in near future.

We tried implementing it on our own for couple of times, but without any luck. The problem lies in fact that Croatian language (Serbian and Bosnian too) have digraph characters (single characters consisted of two characters - lj, nj and dž). And without proper support for those, we will never be able to sort things right (a-b-c-č-ć-d-dž-đ-...i-j-k-l-lj-m-n-nj-...u-v-z-ž)

What does it take to implement Croatian utf8 collation? It takes modifying source code beyond our knowledge (we tried creating new collation with Vietnamese as a base for digraphs as a pair of basic latin letter + accented Latin letter).

AFAIK the countries which would benefit from the same implementation (alongside Croatia) are: Bosnia, Serbia (for latin charset) and Monte Negro (for latin charset). So please, if you can - spread the word! I think that support for this would be appreciated by thousands of MySQL developers in our region who are now forced to use hacks from '90 to get correct sort order. :)

I've submitted S4 feature request to MySQL - http://bugs.mysql.com/44523 and
I've posted a feature request/proposition on official MySQL dev forum, so we will see what happens. It certainly wouldn't harm if you would sign in to bugs.mysql.com and MySQL dev forum and reply to my feature request and topic with "Yes please" or something similar. It's free, and it can make difference. :)

Oracle buys Sun for $7.4 billion! Hello scott/tiger!


On 26 February 2008, Sun acquired MySQL AB. Today, Oracle Corporation announced acquisition of Sun Microsystems. All that acquiring is nice, but what's in it for us - MySQL developers? Well, we will have to wait and see. I wouldn't have anything against switching to Oracle. :) Oracle was my first real database I used back in '90s (Oracle 7), and after reading 2 meters of Oracle books, I was pretty happy with PL/SQL, triggers, views and stored procedures. Switching to MySQL was hard. :)

What is certain - Oracle does not plan to shutdown MySQL. According to Oracle FAQ about the acquisition: "MySQL will be an addition to Oracle’s existing suite of database products, which already includes Oracle Database 11g, TimesTen, Berkeley DB open source database, and the open source transactional storage engine, InnoDB,".

What do you think will happen to MySQL?

Online tickets for U2 concert in Zagreb = fail

U2 is (aside from Lepa Brena) - the most anticipated concert of the year in Croatia. Marketing campaign was pretty strong (so I guess organizers invested some money) and commercials strongly advertised online ticket sale which started just couple of minutes ago (at 00:01AM).


Unfortunately for U2 fans, online ticket sales maybe did start, but they sure did end pretty fast. But not because web shop run out of tickets, but because web shop and organizers of the concert ran out of professionalism.

Look guys, if you decided to run so strong advertising campaign across couple of countries which loudly and clearly says: "Ticket sales start online on Friday at 00:01", I suppose one who does all that, would also build a website which can sustain such number of customers?

Well, I guess not.

Exactly the same thing happened some time ago when Croatian Football Association launched online reservation of world cup tickets (if I can recall correctly). I am not big sports fan but many of my friends are and they were very pissed.

Servers are so dirty cheap, and downtime costs a lot of money (nerves, reputation...), Especially if downtime happens a minute you launch your service. My personal preference/advice regarding server configuration in time of special events or service launch is - tend to over engineer your network! EC2, EBS, S3 (Amazon Web services) and other cloud computing solutions do their job pretty good in situations like this.

Otherwise you get this beauty...

Centrala.hr – edited by human not a robot

We just released your new homepage - Centrala.hr (eng. plānt - A building or group of buildings for the manufacture of a product; a factory.). :) Centrala is a human powered news aggregation site, where editors hand-pick best information from across the region just for you. In the avalanche of completely automatized rss aggregation news sites in Croatian cyberspace, we think a human touch can make a significant difference. What do you think?


Thanks to everybody for such a nice feedback:

Letter Đ mistery – Ð!=Đ and ð != đ

We spent couple of hours trying to figure out this one, so we decided to share it with our fellow readers and hopefully, save some time and unnecessary stress for them. UTF-8 is defacto standard encoding for multilanguage websites. PHP5 doesn't have native support for it, but you can integrate it. MYSQL supports utf8, but not utf8_croatian_collation. But you can also hack your way around that for sorting. If you are not dealing with Croatian letters and UTF-8, you probably won't encounter this problem. However, if you ARE using UTF-8 read on!

We are redesigning one large website and naturally, some content from old website has to be migrated to new site. Since we don't have access to their database, we have to manually copy content from old website's HTML source. Client's old site is in UTF-8. Imagine following entry in old HTML source:


This is a select tag with list of cities in Croatia, the city in example is Đakovo. So as I said, we are copy-ing large amounts of data from old HTML's, and we do not care why old site displays this city as "&#208;akovo" and not as "Đakovo". After we pasted "Ðakovo" into our engine, we momentarily started to get all sorts of different errors. The most bizare one was comming from our utf8_strtolower function which had to return string "đakovo". Instead it was returning string "ðakovo". Wicked!

After some digging, we found out that there was nothing wrong with our utf8 php methods nor our mysql database. We found out that the letter Ð (hex d0 00) we pasted from HTML was just not the letter Đ (hex 10 01) we needed. Somebody (or something) very brilliant, used letter Eth instead of standard Croatian letter D with stroke.

Eth (Ð, ð; also spelled edh or eð) is a letter used in Old English, Icelandic, Faroese (in which it is called edd). In the Unicode universal character encoding standard, upper and lower case eth are represented by U+00D0 and U+00F0. These code points are inherited from the older ISO 8859-1 standard. In HTML, eth is represented by the Latin character entities Ð and ð.

Đ (lowercase đ) is a letter of the Latin alphabet, formed from D with the addition of a bar or stroke through the letter. This is the same modification that was used to create eth (ð), but eth is based on an insular variant of d while đ is based on its usual upright shape. In Unicode the letter is represented as U+0110 LATIN CAPITAL LETTER D WITH STROKE and U+0111 LATIN SMALL LETTER D WITH STROKE.

So there you have it. No matter the letters look the same, beware what you copy/paste. :)


MySQL 5.0 introduced NDB Cluster, a storage engine which enables running several MySQL servers in a cluster. It uses high-availability, high-redundancy version of MySQL adapted for the distributed computing environment. This all sounds really nice, and in theory, depending on your application and scalability planning, it could spare you MANY hours of developing (squeezing) database logic inside your application logic which then gets highly dependable on master/slave (active-pasive) configuration. And of course best reason of them all - instant failover. So I decided to give it a try. I used MySQL 5.0.45-0, because 5.1 sounded too much bleeding edge.

MySQL Cluster is currently not supported on Microsoft Windows, you'll have to install it somewhere else. Once you have it running, it's time to import some data. Database schema of a my test dump mostly used MyIsam storage engine and few InnoDB tables. Before the import, change Engine=NDB in the dump and be aware, import takes ages even on a really fast servers (200 kb dump = cca 1 min ??).

My best guess is, you will see a lot of HY00 errors when you start your import. Be patient, check one by one. Most of them are caused by non supported features of NDB but can be avoided.

Here is a brief coverage of reasons for the problems which I encountered:

  • NDB does not support foreign keys or FULLTEXT indexes or indexes on text columns (indexes on char/varchar are ok).
  • Attribute names are automatically truncated to 31 characters. Database names and table names can total a maximum of 122 characters - the maximum length for an NDB table name is 122 characters, less the number of characters in the name of the database of which that table is a part.
  • The maximum number columns and indexes per table is limited to 128.
  • Temporary tables are not supported.
  • Every table using the NDBCluster storage engine requires a primary key; if no primary key is defined by the user, then a “hidden” primary key will be created by NDB. This hidden primary key consumes 31-35 bytes per table record.
  • The maximum permitted size of any one row is 8KB. Note that each BLOB or TEXT column adds 256 + 8 = 264 bytes towards this total.

The last error is really nasty one and will cause you to rethink your database schema and make a lot of changes to your application):

ERROR 1118 (42000) at line 794: Row size too large. The maximum row size for the used table type, not counting BLOBs, is 8052. You have to change some columns to TEXT or BLOBs

Some help:

  • The lack of foreign keys, can be resolved by using triggers (Enforcing Foreign Keys Programmatically in MySQL).
  • You need a lot of RAM in your severs because all data is in the RAM, unless you are using the 'DATA ON DISK' feature of MySQL 5.1.x. In that case you can have the non-indexed data on disk.
  • When calculating Cluster memory requirements, very useful is ndb_size.pl utility. This Perl script connects to a current MySQL (non-Cluster) database and creates a report on how much space that database would require if it used the NDBCluster storage engine. When you run the script, compare the suggested parameter values reported for your database to the default values for each shown here. If the default is higher than the recommended value from ndb_size.pl do not adjust the value. If the recommended vales are lower than the defaults use slightly larger numbers.

NDBCluster is something I'll definitively have my four eyes on and watch it grow. Bugtracker on mysql is crawling with bugs, but this is something to be expected. I've tested 5.0, and there is a lot of stuff allready fixed in 5.1. My biggest disappointment was the fact that new nodes can't be added to the cluster without shutting it down first. My conclusion is, that It is not yet ready for prime time. When it will be? when 5.1 is released into production I guess. Until then... We are on our own. :)

mysql utf8 latin2 utf8 weirdness

This one is a keeper, have to write it down somewhere in case I forget it by tomorrow morning. :) Today I was doing routine deploy of one of our new websites to production server. Everything went smoothly, except the fact I couldn't see any of Croatian letters on the site texts (which were stored in mysql)!! We did migration to UTF8 some time ago, and this was highly unexpected. So I started to dig into the dump. :)

Looking at the mysqldump of a database I was a bit surprised to see Croatian letters screwed up beyond recognition. For example letter "š" was represented as "ÄąÄO„" (4 bytes), "č" was "čO" (two bytes)... etc. OMG! All our tables are "CHARSET=utf8", we force "AddDefaultCharset utf-8" in Apache and all of our markup uses "charset=utf-8" for content type encoding. BUT, we unintentionally left one very bad call in our config file which got triggered if site was running on development server - "SET NAMES 'latin2' COLLATE 'latin2_croatian_ci". The production config had SET NAMES utf8, but development didn't. Oh boy.

Ok so now I had a real mess - utf8 tables with utf8 data in them stored as re-encoded latin2 but in utf8!? I tried converting the dump from ISO8859-2 to UTF8 but that just made things worse, some of characters now used 6 bytes which isn't good. The solution was pretty straight forward. I converted the dump from UTF8 to ISO-8859-2 and I got real UTF8 again. I imported the converted dump and changed for good config files to "SET NAMES 'utf8' COLLATE 'utf8_general_ci".
iconv to the rescue:

mysqldump -u root -p db > weirdo-dump.sql
iconv -f UTF-8 -t ISO_8859-2//TRANSLIT weirdo-dump.sql > dump-latin2-aka-realUTF8.sql
mysql -u root -p db < dump-latin2-aka-realUTF8.sql