Articles in The 'textual-content' Tag


March 2 2009

Character Sets and Your Web Site (Simplified)

by Jordan Sandford

Everyday, I work with a bunch of characters while I work on clients’ web sites. Happy characters, popular characters, important characters. The characters I’m talking about though, are not people, but are digital representations of letters and numbers inside the computer. Yes, the sentence you are reading right now is made of letters, punctuation and spaces, but to a computer, they are all called characters.

In the early days of computers, there was a small set of characters that computers were designed to show. These included the 26 Latin/English upper-case letters, numbers and punctuation. Eventually, lower-case letters were thrown in as well. This basic set of characters totaled around 128 and included characters you couldn’t see (like a space, a tab, a line feed/enter). This was called the ASCII (pronounced: as-key) character set. This worked great for a while, particularly if you only needed to show English characters. What if you wanted to show special symbols? If you wanted to show 150 unique symbols, after you got past 128, the computer would not know what to do. It would probably show a weird symbol that did not look like the symbol you had in mind for symbol #140 and would be the computer’s interpretation of symbol #140. Or what if you wanted to use Greek letters (or any other non-Latin characters) in combination with Latin characters? To fix that, new character sets were created that included all the ASCII characters and added to them. Other character sets contained a completely different set of letter and symbols for other languages, for example, Cyrillic and Arabic.

Herein lies a potential problem. When the computer is told to use a particular method to decode the binary data in order to find the expected matching character, it may run into binary data for which it can’t find the matching character you were expecting. This often shows up, for example, when Microsoft Word automatically converts straight double quotes to curly quotes and then someone copies that text into a web page editing program. If the web page is not set to display the correct set of characters, the curly quotes could be out of the range of characters the web browser can display and things like question marks in black diamonds may show instead. Or worse, a group of two or three symbols may show instead of your right or left curly quotes.

One of the simplest ways to fix this is to find an equivalent character and replace the bad character with the one you found. You can erase the curly double quotes and type a set of straight double quotes. If perhaps you didn’t create the web page you’re reading and some of the characters look wrong, you can check the setting that your browser calls Character Encoding to make sure it matches the language the web page is made in. If you want a real demonstration of this, go to http://en.wikipedia.org/wiki/Cyrillic_alphabet, review the page, especially the list of languages on the left side bar, and note what your browser currently has set for its Character Encoding. (Go to the View menu and click Character Encoding (Firefox) or Encoding (Internet Explorer).) If you change your character encoding (just remember what it was originally), you’ll see many characters that display weirdly.

Another way to fix this is perhaps one of the best solutions: use a character set that includes as many characters as possible. This is known as Unicode, which is technically a character encoding, but for the sake of this post, it’s synonymous with a character set. You should use the Unicode character encoding as much as possible, as it is designed to replace many character encodings all at once.

The character encoding that your browser uses by default is generally set by some code in the web page itself using a special meta tag (it can be set using other methods as well):
This meta tag specifies the UTF-8 encoding, which is a form of Unicode. Make sure web pages you create use this meta tag and place it as early in the HTML code as possible because the browser interprets all the remaining characters in the HTML code according to it.

If you’re wondering why you shouldn’t just use an image to display your troublesome text, remember that you probably want that text indexed by the search engines, and they most likely can’t read the text in the image. Attempting to you use an alt attribute on the image, you’ll find that the engines will generally not apply the same importance to text in the alt attribute as to real text. In addition, www.google.com as well as Google’s SERPs use Unicode. This means that Google is using Unicode as a catch-all character set to maximize the likelihood that all characters shown on the SERPs (which in a sense, Google does not have control over that content) will display correctly.

© 2023 MoreVisibility. All rights reserved.