Articles in The 'SEO-theory' Tag


October 16 2008

Fixing un-canonical URLs. Oh joy! Part 5

by Jordan Sandford

Welcome back to my series on fixing un-canonical URLs. To date, we’ve looked at a variety of areas that could potentially cause the same content to be accessible from multiple URLs on your site, which is very problematic from a search engine perspective:

  1. Protocols (http and https)
  2. Domain and subdomain names (sometimes referred to as host names)
  3. URL paths
  4. File names
  5. Case sensitivity (when myPage.html is handled differently than MYPage.HTML)
  6. Query strings

Let’s talk about the last item in my list:

          7.    Combinations of any or all of the above

It is possible, for example, to have the same content accessible from both protocols (http and https) as well as both the www- version and the non-www- version. This scenario provides four URLs that display the same exact content:

http://www.example.com
https://www.example.com
http://example.com
– https:// example.com

Any combination of the issues (numbers 1 – 6) may be lurking on your site. In addition, one section of your site may be suffering from one combination of these issues while other sections may be suffering from another combination of issues. I usually find that the simplest way to fix a combination of issues is to first test for one issue and fix it and then move on to the next issue.

All of the six potential problem areas I discussed were caused by efforts to make the internet easier, more forgiving to use and reduce the amount of work a web visitor or web site administrators had to do. The underlying design of web servers was created before search engines like Google or Live existed and before duplicate content issues were a problem. Since web servers weren’t really designed with search engines in mind, you should keep in mind the above list as you comb your site for canonical URL issues.

Perhaps the easiest way to avoid un-canonical URLs is to build your site (or section of your site) from the ground up with these potential problem areas in mind. Granted, that may be easier said than done.

I also suggest that any new pages/files/URLs you create on your site have a file extension appropriate to the scripting language on your sever (.php, .asp, .aspx, .cfm, etc.) as opposed to .html or .htm (which are normally assumed to be file extensions of a “static” page). The reason is that if you need to redirect, for instance, from example.com/mypage.html to www.example.com/my-new-page.html and your web server limits or doesn’t support the use of tools like URL rewriting, you may have to take an SEO hit after renaming the file. This is because an html file normally cannot run scripts. (Redirecting to another page is a script function.) So essentially, creating new files on your website with “dynamic file extensions” allows much more flexibility in the future.

What’s even better is to build your site or section using a CMS that was designed to be SEO friendly.

Remember to watch out for misspellings in your URLs, that includes the path name (the part of the URL starting with the first forward slash up to, but not including the question mark or fragment) and the query string. Also keep in mind that everything in the URL except the fragment can affect canonical URL/duplicate content issues. Another point is that a phone call or email to your hosting company may be able to resolve some canonical URL issues when you can’t seem to resolve a particular issue yourself.

Also, be on the lookout for the new anything-goes top level domains (the “police” in traffic.police would be an anything-goes top level domain, for example, while edu, com and org are traditional top level domains) which could offer a few more URL canonicalization challenges in the near future.

I hope this series was helpful, time saving and useful. Best wishes to you and yours on all your URL canonicalization efforts!

September 19 2008

Fixing un-canonical URLs. Oh joy! Part 3

by Jordan Sandford

Welcome back to my series on fixing canonical URL issues. In my last post, Fixing un-canonical URLs. Oh joy! Part 2, I discussed how host names can present un-canonical URLs to the search engines and how to fix those problems. As a review, here again are the areas of canonical URL issues:

1. Protocols (http and https)
2. Domain and subdomain names (sometimes referred to as host names)
3. URL paths
4. File names
5. Case sensitivity (when myPage.html is handled differently than MYPage.HTML)
6. Query strings
7. Combinations of any or all of the above

Let’s continue expanding on other potential culprits of un-canonical URLs.

Your web server is using a setting called ‘default index file’ or ‘default content file.’ This setting was used to allow your web site visitors to see a custom listing or ‘index’ of a directory’s contents by just knowing the name of the directory and without having to know the name of the file that shows this index. This setting is also part of the setup tasks for a new web hosting account and provides security in the case you to didn’t want to show all visitors a list of all files in that directory. This default index file is usually used as an introductory page to the contents in that section of your web site. (Default index files, depending on the kind of server your site is on, have names such as index.html, index.htm, home.html, default.htm, default.html, default.asp, default.aspx, index.php, index.cfm and so on.) With this setting in place, when a visitor goes to a URL that ends in a name of a directory on your web server (including the ‘root’ directory), a trailing forward slash and no file name (something like http://www.example.com/), the web server does not redirect anywhere, but shows the first file it can find in the list of default index files to the visitor. If the visitor types the same URL, but ends it with the name of the default indeed file (something like http://www.example.com/index.html), the web server will also not redirect, but will show the same exact content as without the file name.

Before trying to canonicalize a default index file to a forward slash (“/blog/index.php” -> “/blog/” for example), make sure that the content you expect to show does in fact show when a visitor leaves off the file name in the URL. Also make sure that the web server only responds with a 200 response code. After doing so, you can use one of several methods to 301 redirect “/blog/index.php” to “/blog/” or whatever your situation is. One method is using URL rewriting rules and regular expressions. This method generally provides a faster reaction time by your web server (read: less wait time for your visitors) and is probably a cleaner way compared to the other method, which is incorporating specific logic into your ‘include files.’ The logic of both methods is pretty simple: if the requested URL ends in a forward slash plus one of the default index file names, send a 301 response code and the location of the redirection to the browser. The redirection location will be originally-requested URL with the default index file removed, and ending in a forward slash plus any query strings and/or URL fragments.

When you created a new directory under blogs called Colors, you forgot your convention for naming directories was all lower-case. After creating the directory, you tested it to make sure your visitors would have no problem getting to the pages in that directory. You went to www.example.com/blog/colors and everything looked great. You didn’t realize you made a mistake until noticed in your traffic logs that many people were looking at a slightly different URL: www.example.com/blog/Colors. Most of the time, this is caused by having your site on a Windows-based server and not having anything in place (such as a CMS that is aware of this issue) that can handle this problem. Windows servers are case-insensitive; Linux and Unix servers are case-sensitive. If your site was running on a Linux server and a visitor browsed to www.example.com/blog/colors, they would probably get a Page Not Found error because the ‘colors’ directory doesn’t exist. Windows’ case insensitivity makes it easier for visitors to get to pages in your site if you’ve mixed upper and lower case letters in either your directory or file names (or both).

You can resolve canonical URL issues related to case insensitivity with a variety of methods. First, you can try the tools offered by Google’s Webmaster Tools to find these issues if you’re not sure if or where they might lurk in your site. You can use a simple rewrite rule to 301 redirect any case-variation of a particular URL to the canonical URL. You can use a CMS or blogging software that will automatically change your new page, category or tag name to all lower-case before that page, category or tag goes live on your site.

Also, it is important to know that the paths you enter in your robots.txt file are case-sensitive. If you mean to block access to www.example.com/dontgohere.php by adding “/DontGoHere.php” to your robots.txt file, www.example.com/dontgohere.php will not be blocked.

My next post in this series will be about query strings, the stuff at the end of the URL after the question mark. Please stay tuned!

July 25 2008

Fixing Un-Canonical URLs. Oh joy! Part 1

by Jordan Sandford

In my previous post, Filenames, host names and canonicalization, oh my!, I talked about how the duplicate content issues affect your search engine rankings, and specifically how un-canonical URLs create this issue. I mentioned different types of URL canonicalization issues you are likely to deal with in your SEO work. Since that post, I created a new, more complete list of canonical URL issues, and will go into a bit more detail in this and future posts describing how the issues actually arise and how to fix these issues.

Let’s start with a detailed guide on how to pronounce canonical: ca – NON – uh – cul. Also, canonicalization is pronounced: ca – non – uh – cul – i – ZAY – shun. That was easy!

Without further ado, here is the new list of areas of canonical URL issues:

1. Protocols (http and https)
2. Domain and subdomain names (sometimes referred to as host names)
3. URL paths
4. File names
5. Case sensitivity (when myPage.html is handled differently than MYPage.HTML)
6. Query strings
7. Combinations of any or all of the above

Let’s break these down using an example URL:

https://example.org/blog/Colors/default.asp?action=go&sessid=6468439#section3

1. https is the protocol
As far as search engines go, there is only http and https.

2. org is the top level domain

3. example.org is the domain name
No subdomain is used. (Technically, www is a subdomain just like any other subdomain.)

4. /blog/Colors/default.asp is the URL path (everything after the top level domain, including the file name, and before the query string, if any)

5. default.asp is the file name
Many times, the URL path may not contain the file name at all and will just end with the forward slash (e.g. /blog/Colors/).

6. ?action=go&sessid=6468439 is the query string (everything after the question mark and before the # sign, if any)
The query string is made up of parameters and their values. “action” is parameter and “go” would be its value.

7. #section3 is the URL fragment (also known as a named anchor or bookmark)
A URL fragment tells the web browser where inside a specific web page to go to or to scroll to. It does not tell the web browser (nor the search engine) which specific page to go to, so therefore does not contribute to any URL canonicalization issues.

To illustrate each of these, here is perhaps the worse case scenario as it relates to canonical URLs (I will continue to add to this scenario from post to post in this series):
You have a site that has forms visitors fill out. You have an SSL certificate and people can go to (https://www.example.com) to see your site. Now, you want to give your visitors peace of mind by letting them know that your site is secured (by making sure that browsers see the lock icon that indicates a secure site). When your https version of your site is setup, the web server may pull files from the non-secure section, but send them back to the browser over a secure connection. (This ‘usability feature’ provides the convenience of only having to maintain one set of files instead of two sets.) Now, when visitors go to https://www.example.com/blog, they’ll see the same content as http://www.example.com/blog. This is your first duplicate content issue.

There are different ways to handle protocols when dealing with canonical URLs. You can block all access to the secure version of your site using a robots.txt file. However, if your web server or web hosting account is setup to serve your secure site from the same files as the non-secure files, search engines will see the regular robots.txt file when going to https://www.example.com/robots.txt. The most flexible way to circumvent this is to create a version of the robots.txt file that’s used only for https and name it appropriately. Then use rewrite rules to internally redirect all requests of https://www.example.com/robots.txt to robots-ssl.txt. The search engines will still think it’s looking at https://www.example.com/robots.txt, and the industry standard. If your site is running on an Apache web server, see apache.org’s URL Rewriting Guides and yourhtmlsource.com’s URL Rewriting Guide.

Another way to handle this will only work if all your files on your web server can run server-side scripts. Usually, this is the case with .asp, aspx, .cfm or .php files. If you have .htm or .html files, you may be able to ask your web hosting company to allow server-side scripts to be ran in these files. Once all files on your web server can run server-side scripts, make a script that checks if the file was accessed via https, and if so, add a robots meta tag that disallows the page from being indexed. This script needs to go in every one of your files, either by ‘including’ them or pasting the script in at or near the beginning of the page. The file-include method can drastically reduce the time it takes to administer this script since you only have to make changes to the script in one place; all files that are including the file that has changed will be essentially updated automatically.

My next post will discuss canonical domain names, including subdomains, and URL paths. In following posts in this series, I will provide methods that maintain the usability features of the web (for both the webmaster and the visitor) yet prevent the duplicate content/canonical URL issues. Stay tuned!

© 2023 MoreVisibility. All rights reserved.