Enquire Now

Unicode’s massive growth: Will mangled characters soon be a thing of the past?

How long has it been since you last received an email that came out as absolute gibberish? I do not exactly remember myself, but it has been a long time. There used to be hundreds of different encodings for storing and indexing text on the internet. Since these encodings were different, they were not as such compatible with one another. “For example, on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (ג), so when Americans would send their résumés to Israel they would arrive as rגsumגs”, Joel Spolsky explains on his software blog.

Unicode was invented to solve this problem by encoding all human languages and making it universal as its name implies. So basically, according to Senior International Software Architect at Google, Mark Davis: “The more documents that are in Unicode, the less likely you will see mangled characters (what Japanese call mojibake) when you are surfing the web.”

The latest figures from Google’s annual survey of the percentage of the webpages in their index that are in different encodings reveal another significant upward move for Unicode. It now accounts for more than 60% of all web encodings, which is indeed good news to the many of us who enjoy translation services that allow us to find information in almost any language. Subsequently, this helps content-providers such as marketers expand their potential audiences.

Unicode 60 Percent

*Your mileage may vary: these figures may vary somewhat from what other search engines find. The graph lumps together encodings by script. We detect the encoding for each webpage; the ASCII pages just contain ASCII characters, for example.

Google has long been using Unicode as the company’s internal format for all the text it searches and processes and will soon be updating to the newest Version 6.1 with over 110,000 individual characters. According to Mark Davis, the search giant’s unified index probably would not exist had it not been to Unicode. Or as he puts it himself: “(…) it’d be a bit like not being able to convert between the hundreds of currencies in the world; commerce would be, well, difficult.”

The following two tabs change content below.

Immanuel Simonsen

Research Manager at Webcertain
As the Research Manager of multilingual web marketing agency Webcertain, Immanuel heads up the company’s global market research activities and large content marketing projects. He is the author of several reports and guides, including ‘The Essential Guide to Rel-Alternate-Hreflang’ and ‘The Webcertain Global Search and Social Report 2013’. Apart from being a tutor at the International Marketing School -- teaching online marketing professionals on business opportunities around the globe -- Immanuel is a regular speaker at the International Search Summit, a leading event series dedicated to multilingual search and social media marketing.

Leave a Reply