Shamed and inspired by Joel Spolsky’s decade-old article on character encoding, and warned by another blog post (which I can’t find) that described a three-day ordeal fighting the issue, I cleared some time and sat down to make sure all the pages on my site were using UTF-8, start to finish.

It wasn’t nearly as bad as I thought it would be. In fact, it probably took one or two hours. Here is what I did:

  • For MySQL: make a backup, edit the backup file, replacing “latin1” with “utf8,” restore
  • For Browsers: add the utf-8 header (below) in my head tag. Luckily, all of my pages use the same template for drawing headers, so this was only a single change for the web site, and a single change for my admin pages.
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  • For static text: Change my code editor (Eclipse) to use UTF-8, and re-save some files that had hard-coded em-dashes. (I should have used HTML entities for those anyway.) The setting is under Window > Preferences > General >  Workspace > Text file encoding
  • For PHP/MySQL: this was the least obvious change. You have to specify the character set for the connection between PHP and MySQL. If you have a newer version of PHP (>= 5.2.3), you can use mysql_set_charset. Otherwise, you need to send a SQL command directly:
    // for PHP >= 5.2.3
    // for older versions (see also MySQL documentation for SET NAMES):
    mysql_query('SET NAMES "utf8"');

I don’t believe how easy it was – and I mean that. I expect to get burned somewhere along the way. But for now, it looks good and was pretty painless.

Since then I have found broken special characters here and there, but I’ve been able to find them and fix them as I go. Let’s hear it for incremental improvements!


