Shamed and inspired by Joel Spolsky’s decade-old article on character encoding, and warned by another blog post (which I can’t find) that described a three-day ordeal fighting the issue, I cleared some time and sat down to make sure all the pages on my site were using UTF-8, start to finish.
It wasn’t nearly as bad as I thought it would be. In fact, it probably took one or two hours. Here is what I did:
- For MySQL: make a backup, edit the backup file, replacing “latin1″ with “utf8,” restore
- For Browsers: add the utf-8 header (below) in my head tag. Luckily, all of my pages use the same template for drawing headers, so this was only a single change for the web site, and a single change for my admin pages.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
- For static text: Change my code editor (Eclipse) to use UTF-8, and re-save some files that had hard-coded em-dashes. (I should have used HTML entities for those anyway.) The setting is under Window > Preferences > General > Workspace > Text file encoding
- For PHP/MySQL: this was the least obvious change. You have to specify the character set for the connection between PHP and MySQL. If you have a newer version of PHP (>= 5.2.3), you can use mysql_set_charset. Otherwise, you need to send a SQL command directly:
// for PHP >= 5.2.3 mysql_set_charset('utf8',$cn); // for older versions (see also MySQL documentation for SET NAMES): mysql_query('SET NAMES "utf8"');
I don’t believe how easy it was – and I mean that. I expect to get burned somewhere along the way. But for now, it looks good and was pretty painless.
Since then I have found broken special characters here and there, but I’ve been able to find them and fix them as I go. Let’s hear it for incremental improvements!