It’s been a while since we’ve talked about URL canonicalization on our blog, so I’ll quickly review what it is and then talk about filename canonicalization and how it can affect your SEO endeavors.
Canonicalization is something done to your site to help ensure that content from one specific URL in your site does not show up under another URL. This is a type of a duplicate content issue. If the engines index more than one copy of some specific content, from separate, full URLs (those containing all necessary parts), they will be forced to divide the “strength” of that content between the URLs. That reduces the “strength” of all URLs involved and reduces the chance that one of your URLs will show up in a search results page for a given search term. Most of the time, the issue will arise when both www.example.com and example.com (and all other pages of the site) show the same exact contents.
Having un-canonical host names is one way that duplicate content can become a problem for your site: a search engine indexes content from your home page at www.example.com and indexes the same exact content at example.com. While on the subject of un-canonical host names, you should also know that it is possible to have un-canonical protocols in your URLs. In a complete URL like http://www.example.com/about-us.html, the “http” is known as the protocol. If search engines happen to index the same content from https://www.example.com/about-us.html, you may have just stepped into a duplicate content issue due to un-canonical protocols.
Now let’s say you have a store (or any other web-based application) on your site that is accessed from http://www.example.com/store. Generally, your ecommerce program would be located in a folder on your server called ‘store’ (and usually, URLs of your store’s products are based on that URL: http://www.example.com/store/blue-widgets.html). It’s quite possible that a search engine could index content at http://www.example.com/store/index.php and at http://www.example.com/store/ and count them as duplicate content. This could happen if another site links directly to one of those URLs and your site links to the other URL. This is an example of what I call a file name canonicalization issue.
Those are basically all the canonicalization issues you could encounter. However, often times, you’ll find a site that has a combination of canonicalization issues. For example, all of the following URLs are different, but will all have the same exact content:
Perhaps you’re wondering if there would be a duplicate content issue between the following two URLS:
The answer will most likely depend on whether ‘blog’ is an actual folder on your server or a file. (Your web server or blog, or any other web-based application may be setup in a way that does not require file extensions such as .html.) Most web servers (the software on the hosting company’s computer that waits for and then responds when someone asks for a web page on your site) work the same way in this situation. If ‘blog’ is an actual folder on the web server, then whenever someone asks for http://www.example.com/blog (without the trailing forward slash), the web server automatically 301 redirects to http://www.example.com/blog/. (This is the correct redirect type for this situation, by the way, and this is the reason that this should not cause a duplicate content issue.) If someone asks for http://www.example.com/blog/ (with the trailing forward slash), the web server doesn’t redirect at all. I’ll talk more about this in my next post.
Marjory Meechan, in her blog How to Resolve the Canonicalization Issue without Access to your Server, discussed un-canonical host name issues and one way that can be used to fix this. Please read my next post in which I discuss the causes of these canonicalization issues and suggest more dependable methods to eliminate them, provided that you have some level of server access.