Get simple, actionable information you can use to gain insight into SEO, content production, competitor data and more with our SEO Tips & Tools blog posts. Learn how to use a variety of tools and browser plugins to see your website how the search engines see it, and find opportunities to enhance your content, link portfolio and SEO.
A few weeks ago we noticed that one of our clients had a single web page showing up multiple times in Google. In the SEO world, this is known as duplicate content and is generally frowned upon. The URLs for the two entries in question looked something like this:
As you can see, the difference in case is what designates this as being two separate pages to the search engines, but they do in fact point to the same physical page. In the Linux world, this would not matter so much, because URLs are case sensitive, so one of the above URLs would throw a 404 or Page Not Found error. So in effect, only one page would get indexed. However, in the Windows world, URLs are not case sensitive, so both URLs would serve up the same webpage. Upon further examination of the internal linking structure, we noticed that the same case (lower) was being used to reference all of the internal URLs. The problem came from the outside world pointing to the same page in various case combinations. This meant that people who were linking to the website from the outside world could have potentially given the website duplicate content penalties from search engines, which is not good.
The solution to this problem is quite simple and elegant. We added one rewrite rule to the website’s .htaccess file, which permanently redirected (301) any URL containing upper case characters to its all lower case equivalent. Since you cannot predict, or enforce how people link to your website, we strongly suggest you utilize this simple solution to prevent this duplicate content penalty.
Everyday, I work with a bunch of characters while I work on clients’ web sites. Happy characters, popular characters, important characters. The characters I’m talking about though, are not people, but are digital representations of letters and numbers inside the computer. Yes, the sentence you are reading right now is made of letters, punctuation and spaces, but to a computer, they are all called characters.
In the early days of computers, there was a small set of characters that computers were designed to show. These included the 26 Latin/English upper-case letters, numbers and punctuation. Eventually, lower-case letters were thrown in as well. This basic set of characters totaled around 128 and included characters you couldn’t see (like a space, a tab, a line feed/enter). This was called the ASCII (pronounced: as-key) character set. This worked great for a while, particularly if you only needed to show English characters. What if you wanted to show special symbols? If you wanted to show 150 unique symbols, after you got past 128, the computer would not know what to do. It would probably show a weird symbol that did not look like the symbol you had in mind for symbol #140 and would be the computer’s interpretation of symbol #140. Or what if you wanted to use Greek letters (or any other non-Latin characters) in combination with Latin characters? To fix that, new character sets were created that included all the ASCII characters and added to them. Other character sets contained a completely different set of letter and symbols for other languages, for example, Cyrillic and Arabic.
Herein lies a potential problem. When the computer is told to use a particular method to decode the binary data in order to find the expected matching character, it may run into binary data for which it can’t find the matching character you were expecting. This often shows up, for example, when Microsoft Word automatically converts straight double quotes to curly quotes and then someone copies that text into a web page editing program. If the web page is not set to display the correct set of characters, the curly quotes could be out of the range of characters the web browser can display and things like question marks in black diamonds may show instead. Or worse, a group of two or three symbols may show instead of your right or left curly quotes.
One of the simplest ways to fix this is to find an equivalent character and replace the bad character with the one you found. You can erase the curly double quotes and type a set of straight double quotes. If perhaps you didn’t create the web page you’re reading and some of the characters look wrong, you can check the setting that your browser calls Character Encoding to make sure it matches the language the web page is made in. If you want a real demonstration of this, go to http://en.wikipedia.org/wiki/Cyrillic_alphabet, review the page, especially the list of languages on the left side bar, and note what your browser currently has set for its Character Encoding. (Go to the View menu and click Character Encoding (Firefox) or Encoding (Internet Explorer).) If you change your character encoding (just remember what it was originally), you’ll see many characters that display weirdly.
Another way to fix this is perhaps one of the best solutions: use a character set that includes as many characters as possible. This is known as Unicode, which is technically a character encoding, but for the sake of this post, it’s synonymous with a character set. You should use the Unicode character encoding as much as possible, as it is designed to replace many character encodings all at once.
The character encoding that your browser uses by default is generally set by some code in the web page itself using a special meta tag (it can be set using other methods as well):
This meta tag specifies the UTF-8 encoding, which is a form of Unicode. Make sure web pages you create use this meta tag and place it as early in the HTML code as possible because the browser interprets all the remaining characters in the HTML code according to it.
If you’re wondering why you shouldn’t just use an image to display your troublesome text, remember that you probably want that text indexed by the search engines, and they most likely can’t read the text in the image. Attempting to you use an alt attribute on the image, you’ll find that the engines will generally not apply the same importance to text in the alt attribute as to real text. In addition, www.google.com as well as Google’s SERPs use Unicode. This means that Google is using Unicode as a catch-all character set to maximize the likelihood that all characters shown on the SERPs (which in a sense, Google does not have control over that content) will display correctly.
Last week, in a rare unified move, all three major search engines announced support for a new “canonical URL tag” designed to help search engines understand a website with multiple URLs displaying the same content. Basically, all a site owner needs to do is add this tag to the head section of all versions of a duplicated page. So, for example, this tag:
would be added to the head section of all the versions of the same page shown below:
By adding the canonical tag to all these potential versions of the page, it tells search engines that all these URLs are essentially the same page and should be treated as such. This allows them to easily determine which page should be listed and at the same time ensure that all the linking value for these pages is preserved and combined under one URL.
The introduction of this new tag provides an alternate way for site owners to address duplicate content issues created by the way their site is designed. Up until now, the only solution that worked for all three search engines was to restrict the access of the robots to duplicate pages using instructions in the robots.txt file, robots meta tags or both. Any website owners that have been using the robots meta tag or robots.txt file to deal with this and who decide to switch to the tag will need to remove any instructions restricting access to duplicated pages from their robots.txt files and/or remove the robots meta tags so that search engines can find the new canonical URL tags.
Unfortunately, for some websites, using the robots meta tags and robots.txt file may continue to be the only viable solution to duplicate content, because although this new tag addresses the issue of which page should be indexed, it does not resolve the crawling problem associated with duplicate URLs. Since search engine robots do not realize that these pages are all the same until after they have been crawled and indexed, they may still waste valuable crawling time accessing the same content and potentially delaying the indexing of unique content. Furthermore, all three search engines have indicated that they will view the canonical URL tag as a “suggestion” and will still be using alternate means to determine which URL should be displayed in duplicate content situations. This is why the best course of action is not to give search engine duplicate URLs in the first place and using robots.txt, robots meta tags or the canonical URL tag should only be used if there is no way to program the site to be search engine friendly.
More details about this new tag can be found here: