Welcome back to my series on fixing canonical URL issues. In my last post, Fixing un-canonical URLs. Oh joy! Part 2, I discussed how host names can present un-canonical URLs to the search engines and how to fix those problems. As a review, here again are the areas of canonical URL issues:
1. Protocols (http and https)
2. Domain and subdomain names (sometimes referred to as host names)
3. URL paths
4. File names
5. Case sensitivity (when myPage.html is handled differently than MYPage.HTML)
6. Query strings
7. Combinations of any or all of the above
Let’s continue expanding on other potential culprits of un-canonical URLs.
Your web server is using a setting called ‘default index file’ or ‘default content file.’ This setting was used to allow your web site visitors to see a custom listing or ‘index’ of a directory’s contents by just knowing the name of the directory and without having to know the name of the file that shows this index. This setting is also part of the setup tasks for a new web hosting account and provides security in the case you to didn’t want to show all visitors a list of all files in that directory. This default index file is usually used as an introductory page to the contents in that section of your web site. (Default index files, depending on the kind of server your site is on, have names such as index.html, index.htm, home.html, default.htm, default.html, default.asp, default.aspx, index.php, index.cfm and so on.) With this setting in place, when a visitor goes to a URL that ends in a name of a directory on your web server (including the ‘root’ directory), a trailing forward slash and no file name (something like http://www.example.com/), the web server does not redirect anywhere, but shows the first file it can find in the list of default index files to the visitor. If the visitor types the same URL, but ends it with the name of the default indeed file (something like http://www.example.com/index.html), the web server will also not redirect, but will show the same exact content as without the file name.
Before trying to canonicalize a default index file to a forward slash (“/blog/index.php” -> “/blog/” for example), make sure that the content you expect to show does in fact show when a visitor leaves off the file name in the URL. Also make sure that the web server only responds with a 200 response code. After doing so, you can use one of several methods to 301 redirect “/blog/index.php” to “/blog/” or whatever your situation is. One method is using URL rewriting rules and regular expressions. This method generally provides a faster reaction time by your web server (read: less wait time for your visitors) and is probably a cleaner way compared to the other method, which is incorporating specific logic into your ‘include files.’ The logic of both methods is pretty simple: if the requested URL ends in a forward slash plus one of the default index file names, send a 301 response code and the location of the redirection to the browser. The redirection location will be originally-requested URL with the default index file removed, and ending in a forward slash plus any query strings and/or URL fragments.
When you created a new directory under blogs called Colors, you forgot your convention for naming directories was all lower-case. After creating the directory, you tested it to make sure your visitors would have no problem getting to the pages in that directory. You went to www.example.com/blog/colors and everything looked great. You didn’t realize you made a mistake until noticed in your traffic logs that many people were looking at a slightly different URL: www.example.com/blog/Colors. Most of the time, this is caused by having your site on a Windows-based server and not having anything in place (such as a CMS that is aware of this issue) that can handle this problem. Windows servers are case-insensitive; Linux and Unix servers are case-sensitive. If your site was running on a Linux server and a visitor browsed to www.example.com/blog/colors, they would probably get a Page Not Found error because the ‘colors’ directory doesn’t exist. Windows’ case insensitivity makes it easier for visitors to get to pages in your site if you’ve mixed upper and lower case letters in either your directory or file names (or both).
You can resolve canonical URL issues related to case insensitivity with a variety of methods. First, you can try the tools offered by Google’s Webmaster Tools to find these issues if you’re not sure if or where they might lurk in your site. You can use a simple rewrite rule to 301 redirect any case-variation of a particular URL to the canonical URL. You can use a CMS or blogging software that will automatically change your new page, category or tag name to all lower-case before that page, category or tag goes live on your site.
Also, it is important to know that the paths you enter in your robots.txt file are case-sensitive. If you mean to block access to www.example.com/dontgohere.php by adding “/DontGoHere.php” to your robots.txt file, www.example.com/dontgohere.php will not be blocked.
My next post in this series will be about query strings, the stuff at the end of the URL after the question mark. Please stay tuned!