Welcome back to my series on fixing canonical URL issues. In my last post, Fixing un-canonical URLs. Oh joy! Part 3, I discussed how case-insensitivity and having a default index file could negatively affect your URL canonicalization efforts. Today, we’ll talk about query strings and how they can affect your canonicalization efforts. A query string is a grouping of parameters and values at the end of your URLs that looks like “?podID=249&catID=31”. Let’s review the areas where your site can have un-canonical URLs:
Let’s say you have a web page that is used as a landing page and you want to track which of your affiliates referred some visitors to your site. You give each of your affiliates the URL to that page, but add a customized query string to the URL that contains the name of that particular affiliate. The landing page, like many web pages, will show the same generic content regardless of the query string in its URL, but may have another section of the page that does change (even if it’s unnoticeable by a visitor or a search engine) depending on the query string. Your landing page tallies (in a database) the number of visits from each affiliate to the landing page. Though the treatment of URLs with query strings may vary from search engine to search engine, they have the ability to cause different content to show while using the same domain and URL path. You should consider query strings in your efforts to rid your site of duplicate content.
In addition to tracking affiliates and referrer sources, query strings are also used by web applications to display dynamic content from different sections of a site using the same page template or layout. In fact, this is the most common use of query strings in URLs. With this scenario, you can see that query strings allow the web page creator to show a variety of content dynamically while maintaining just one file instead of maintaining one static web page file for each set of content he or she wants to show. A good example of this is an ecommerce site.
If you happen to have a web page that shows one set of contents using one URL plus query string and the same set of contents using the same URL but a different query string, you may have an un-canonical URL issue.
To fix this, you probably don’t want to blindly 301-redirect all URLs that have query strings to the same, respective URLs with the query string chopped off for the simple reason that someone or something probably intended for those query strings to do something meaningful.
Because, under normal circumstances, you want to use the query string appropriately in your web pages, one safe method to canonicalize URLs with query strings (that show the same content) is to add a robots meta tag that instructs the search engines to not index the page. This will work as long the same URL without the query string does not have the robots meta tag. Many times, however, there is a low chance that the URL with a query string is even known to the search engines because you would not publish that URL anywhere.
Another way to canonicalize URLs with query strings is to create SEO-friendly versions of these URLs. Creating SEO-friendly URLs would normally take a URL like http://www.eample.com/store/podID=249&catID=31 and turn it into something like http://www.eample.com/store/cellphones/nokia-8851-clam-shell-camera.html. It doesn’t make sense to create SEO-friendly versions of URLs such as http://www.eample.com/?affiliateID=1446545 because changing the affiliate ID in the query string would probably not change the content of the page substantially. There are various ways to create SEO-friendly URLs. Your CMS may already offer this. If you’re not using a CMS that supports this, you may have to resort to creating URL rewrite rules for your website. (The first part of this series has several links to resources about URL rewriting for both Microsoft IIS and Apache web servers.)
As the topic of query strings is the last item in the 7-item list above, my next post will wrap up this series and provide some tips in your efforts to rid your site of un-canonical URLs.
Welcome back to my series on fixing canonical URL issues. Here again are the areas of canonical URL issues:
1. Protocols (http and https)
2. Domain and subdomain names (sometimes referred to as host names)
3. URL paths
4. File names
5. Case sensitivity (when myPage.html is handled differently than MYPage.HTML)
6. Query strings
7. Combinations of any or all of the above
In my last post, I discussed how protocols (http and https) can present un-canonical URLs to the search engines and how it can create duplicate content. Let’s pick up where we left off.
You have two domain names, example.com and example.biz. You want traffic to example.biz to see content at example.com. Your hosting company set up your web hosting account on their servers to be able to show visitors to www.example.com and visitors to example.com using the same files (that way you don’t have to maintain two versions of, say, the Home Page). This is the default way most hosting companies create new accounts.
To fix canonical URL issues related to different top-level domains (e.g. edu, com, org, us, etc. — look out for anythinggoes top-level domains as well), domains and/or different (or no) subdomains, you can set up your server to show content from the non-canonical domain(s) to the visitor while, at the same time, that content is banned from being indexed by the engines (using a robots.txt file or the robots meta tag) or the visitor and search engine needs to be properly redirected to the canonical domain. First, choose which subdomain/domain/top-level domain combination you want to be canonical. Set up the web servers or hosting accounts that host the non-canonical domain to ‘301 redirect’ to the canonical domain using the same rewrite rules or the ‘include’ method I previously discussed. (In a future post, I will discuss URL rewriting on Apache servers and compare it to URL rewriting on Windows servers.) Be aware that if you bought multiple domain names from a registrar, only your canonical domain may be actually hosted, while the other domains may be using their ‘forwarding’ service to redirect to your canonical domain. If you use their forwarding service or even their ‘301 redirect’ feature, they may not implement a 301 redirect consistently or properly. I am speaking from first-hand experience with well-known hosting companies.
You were categorizing your pages and realized that you accidentally placed a page in both the /blog/Colors directory and the /blog/shapes directory. This could happen from physically copying a file to another directory or perhaps you are using a blogging application (or any web application for that matter) and categorized a post in two categories. In the latter case, if the blogging application does not handle cross-posts in an SEO friendly way, you might have duplicate content issues.
As far as the URL path goes, it would be a good idea to know which URLs have duplicate content of other URLs on your site. If you don’t know, try the tools offered by Google’s Webmaster Tools. Different web applications and different types of web applications (such as blog software from vendor A vs. vendor B or CMS software from vendor C) handle canonical URL paths differently. Check what web application is powering the pages at those URLs. Also see if there are add-ons or plugins for your software that can handle duplicate content issues created from assigning multiple tags or categories to content. They could add a robots meta tag to one of the duplicate pages.
Other ways to handle this is to 301 redirect one of the duplicate pages to the other. In a pinch, and depending on the URL structure of your site, you may be able to use the robots.txt file to exclude a certain section or an exact URL of your site from the indexes, thereby removing the duplicate content while making sure that any other page in that section is not removed from the index. You want to be careful which URL you redirect from or block using the robots.txt file, because one of those URLs might be more optimized than the other.
Please remember to read my next post in this series about file names, the mysterious forward slash at the end of URLs and case sensitivity in URLs. Check back soon!