Digital content doesn’t always look the same when displayed in different languages. Without proper encoding, characters can appear garbled, leading to confusion, data corruption, or even site failure in global markets.
Whether you're building a multilingual app or localizing your website, character encoding is one of the processes that helps you deliver accurate text and seamless user experiences. Understanding its importance and different formats—like ASCII and Unicode encoding—lets you easily translate and display your messaging for multilingual audiences.
In this guide, we’ll explain what character encoding is, key standards in use today, and why encoding is crucial for internationalization.
What is character encoding, and how does it affect successful website localization?
Character encoding involves converting text into numbers or symbols so machines can process and display text across websites, apps, and operating systems. Since computers operate with binary digits (0s and 1s), character encoding provides a way to translate text from different languages into a digital format that machines can easily interpret.
Here’s a basic encoding example: In one encoding standard, “A” might be represented by the binary sequence “01000001.” The computer reads this binary data and the encoding standard translates it back to the corresponding human-readable character: “A.”
Character encoding plays a critical role in localization for global companies because it lets them accurately display a wide range of characters from various writing systems, including Chinese, Arabic, and Cyrillic. This leads to more accurate, fast translation.
Businesses can also use Smartling’s Global Delivery Network (GDN), a translation web proxy, to quickly launch translations for websites and apps. Once set up, content can flow smoothly from your website into Smartling for translation. Smartling even provides a menu of special characters for different languages and formatting options that you can easily insert into your translations.
What are the main types of character encoding?
Each character encoding format supports different languages and technical needs for content localization.
Encoding type |
Caractéristiques principales |
Supported languages |
Cas d'utilisation |
ASCII |
7-bit, represents 128 characters, including English letters, numerals, basic symbols |
Anglais |
Early computing, simple text files, legacy systems |
ISO-8859-1 |
8-bit, represents various Western/Eastern European characters |
Western European languages, such as German, French, and Spanish |
Legacy web content, international documents |
UTF-8 |
Variable-length (1-4 bytes), backward compatible with ASCII, handles all Unicode characters, no BOM required |
Almost all languages |
Web content, modern applications, cross-platform data exchange |
UTF-16 |
One or two 16-bit code units |
Almost all languages, plus special characters like emojis |
Web and international content, modern text files with special characters |
Windows-1252 |
8-bit, includes Western European characters; superset of ASCII |
Eastern European languages like Polish and Czech |
Legacy Windows-based applications |
Here’s a closer look at the most common types of character encoding, including how they work and examples of how they’re used across localized apps and websites.
ASCII
The American Standard Code for Information Interchange (ASCII) is one of the earliest and simplest character encoding systems. ASCII uses 7 bits to encode 128 characters, covering basic English letters, numerals, and punctuation marks.
With ASCII, each character is mapped to a unique numeric code. For example, the capital letter “A” is represented as 65, and the exclamation mark “!” is 33. In binary, “A” becomes “01000001.” ASCII is foundational for modern encoding schemes but is limited to English and lacks support for accented or non-Latin characters.
ISO-8859-1
ISO-8859-1 is an 8-bit single-byte character encoding format developed by the International Organization for Standardization (ISO) to extend ASCII’s capabilities to Western European languages.
ISO-8859-1 encodes 256 characters, including additional letters with accents like “é” and “ö”, and a host of special punctuation and symbol characters. For example, the character “é” is represented by the code 233, and “ñ” is 241. However, ISO-8859-1 can’t encode characters outside the Western European set, limiting its use for international applications.
UTF-8
Unicode Transformation Format – 8 bit (UTF-8) is a variable-length character encoding format designed for efficient, global language compatibility. With UTF-8, all ASCII characters are encoded using one byte and maintain the same binary values. Non-ASCII characters, such as “€”, use multi-byte sequences, like “11100010 10000010 10101100” for “€” in binary.
UTF-8 can encode every character in the Unicode standard for seamless multilingual support. It’s become the dominant encoding format for the web and modern applications due to its ability to handle vast language sets.
UTF-16
UTF-16 represents each character in the Unicode standard using one or two 16-bit code units. Characters in regular scripts are stored as a single 16-bit unit, and other characters (like emojis and rare historical scripts) appear as surrogate pairs, which are two 16-bit code units combined.
For example, “A” (U+0041) in UTF-16 is 0x0041, while the smiling face emoji “😊” (U+1F60A) is stored as the surrogate pair 0xD83D 0xDE0A. This allows UTF-16 to encode over a million different characters, though it uses more space than UTF-8 for some text.
Windows-1252
Windows-1252 is an 8-bit character encoding type developed by Microsoft for Western European languages, commonly called “ANSI” encoding in Windows environments. It’s similar to ISO-8859-1 but includes extra characters like curly quotes (“ ”) and the Euro sign (€).
Windows-1252 was widely used in legacy Windows documents and applications but has largely been replaced by UTF-8, especially for international compatibility.
Solving character encoding challenges in a multilingual world: 5 challenges and best practices
Character encoding is technical and nuanced, posing certain challenges for businesses trying to reach multilingual audiences. As you localize content at scale, be aware of these common encoding issues and best practices for addressing them:
1. Garbled text
When software interprets text using the wrong character encoding, it can result in garbled, unreadable characters like question marks or random symbols. This usually happens when text encoded in one format (like UTF-8) is opened or rendered using a different encoding (like ISO-8859-1), leading to mismatched byte interpretations. Garbled text creates confusion for users and can diminish a site’s credibility.
Best practice: Aim to use consistent encoding across content. You can do this by using HTML tags like <meta charset="UTF-8"> to ensure browsers process text as intended.
2. Data loss during conversion
When converting content between different encodings—such as from Windows-1252 to UTF-8—some characters may be lost or incorrectly substituted. This can severely impact the readability of multilingual and localized content, and if you don’t catch it early, it could lead to data corruption.
Best practice: Use Unicode-compatible translation tools that streamline conversion between encoding formats. Also, always back up data before performing encoding changes and test results for accuracy.
3. Lack of multilingual font or glyph support
Even with the correct character encoding format, text may appear as blank squares if the necessary fonts or glyphs aren’t available on the user’s device. This issue is especially common when using accented letters or emojis that older or basic font sets might not support.
Best practice: Use web-safe fonts that support a wide range of characters across languages. Create a font stack of fallback characters that the system can automatically insert into the text if needed.
4. Bidirectional text
Languages such as Arabic and Hebrew use a right-to-left (RTL) writing system, which can lead to layout and alignment issues—especially when mixed with left-to-right (LTR) scripts like English. If these languages aren’t properly encoded, they can disrupt the reading flow or break formatting altogether.
Best practice: Apply correct language and direction-specific markups, such as dir="rtl" and lang="ar" in HTML, to help browsers display the text accurately. Choose frameworks that support RTL layouts and test content to make sure it displays correctly for all audiences.
5. Byte Order Mark issues
A Byte Order Mark (BOM) is a special marker at the beginning of a text file that signifies byte order and encoding. BOMs can cause problems such as:
- Unexpected invisible characters at the start of files (e.g., “”)
- Incompatibility with some systems, like older web browsers
- Interoperability between applications
Best practice: When processing text files, use tools that can auto-detect and handle BOMs appropriately. For web content, omit the BOM in UTF-8 to avoid display issues.
Make encoding easy with Smartling’s localization tools
Character encoding is often invisible to users, but it’s vital to how they experience content across devices and languages—from localized apps to multilingual websites. By understanding the basics of encoding, translation and development teams can prevent costly errors and deliver a smoother experience for global audiences.
Smartling is built to help businesses grow globally while simplifying translation for developers with a powerful web proxy and robust API. Smartling integrates directly with your existing tech stacks, eliminating manual tasks and streamlining the translation process. In fact, Smartling lets you add new languages and trigger translation workflows on every code push, so you can expand internationally without missing a beat.
British Airways, for example, leveraged Smartling’s web translation proxy to localize its website for Korean customers. With Smartling’s centralized platform, British Airways was able to translate more than 500,000 words and launch twice as fast as usual. Read the case study to see how they made it happen.