In my previous post, Filenames, host names and canonicalization, oh my!, I talked about how the duplicate content issues affect your search engine rankings, and specifically how un-canonical URLs create this issue. I mentioned different types of URL canonicalization issues you are likely to deal with in your SEO work. Since that post, I created a new, more complete list of canonical URL issues, and will go into a bit more detail in this and future posts describing how the issues actually arise and how to fix these issues.
Let’s start with a detailed guide on how to pronounce canonical: ca – NON – uh – cul. Also, canonicalization is pronounced: ca – non – uh – cul – i – ZAY – shun. That was easy!
Without further ado, here is the new list of areas of canonical URL issues:
1. Protocols (http and https)
2. Domain and subdomain names (sometimes referred to as host names)
3. URL paths
4. File names
5. Case sensitivity (when myPage.html is handled differently than MYPage.HTML)
6. Query strings
7. Combinations of any or all of the above
Let’s break these down using an example URL:
1. https is the protocol
As far as search engines go, there is only http and https.
2. org is the top level domain
3. example.org is the domain name
No subdomain is used. (Technically, www is a subdomain just like any other subdomain.)
4. /blog/Colors/default.asp is the URL path (everything after the top level domain, including the file name, and before the query string, if any)
5. default.asp is the file name
Many times, the URL path may not contain the file name at all and will just end with the forward slash (e.g. /blog/Colors/).
6. ?action=go&sessid=6468439 is the query string (everything after the question mark and before the # sign, if any)
The query string is made up of parameters and their values. “action” is parameter and “go” would be its value.
7. #section3 is the URL fragment (also known as a named anchor or bookmark)
A URL fragment tells the web browser where inside a specific web page to go to or to scroll to. It does not tell the web browser (nor the search engine) which specific page to go to, so therefore does not contribute to any URL canonicalization issues.
To illustrate each of these, here is perhaps the worse case scenario as it relates to canonical URLs (I will continue to add to this scenario from post to post in this series):
You have a site that has forms visitors fill out. You have an SSL certificate and people can go to (https://www.example.com) to see your site. Now, you want to give your visitors peace of mind by letting them know that your site is secured (by making sure that browsers see the lock icon that indicates a secure site). When your https version of your site is setup, the web server may pull files from the non-secure section, but send them back to the browser over a secure connection. (This ‘usability feature’ provides the convenience of only having to maintain one set of files instead of two sets.) Now, when visitors go to https://www.example.com/blog, they’ll see the same content as http://www.example.com/blog. This is your first duplicate content issue.
There are different ways to handle protocols when dealing with canonical URLs. You can block all access to the secure version of your site using a robots.txt file. However, if your web server or web hosting account is setup to serve your secure site from the same files as the non-secure files, search engines will see the regular robots.txt file when going to https://www.example.com/robots.txt. The most flexible way to circumvent this is to create a version of the robots.txt file that’s used only for https and name it appropriately. Then use rewrite rules to internally redirect all requests of https://www.example.com/robots.txt to robots-ssl.txt. The search engines will still think it’s looking at https://www.example.com/robots.txt, and the industry standard. If your site is running on an Apache web server, see apache.org’s URL Rewriting Guides and yourhtmlsource.com’s URL Rewriting Guide.
Another way to handle this will only work if all your files on your web server can run server-side scripts. Usually, this is the case with .asp, aspx, .cfm or .php files. If you have .htm or .html files, you may be able to ask your web hosting company to allow server-side scripts to be ran in these files. Once all files on your web server can run server-side scripts, make a script that checks if the file was accessed via https, and if so, add a robots meta tag that disallows the page from being indexed. This script needs to go in every one of your files, either by ‘including’ them or pasting the script in at or near the beginning of the page. The file-include method can drastically reduce the time it takes to administer this script since you only have to make changes to the script in one place; all files that are including the file that has changed will be essentially updated automatically.
My next post will discuss canonical domain names, including subdomains, and URL paths. In following posts in this series, I will provide methods that maintain the usability features of the web (for both the webmaster and the visitor) yet prevent the duplicate content/canonical URL issues. Stay tuned!
Part III – How to hunt down and patch up ie6 bugs
You are coding out a web page, you fit the styles, arrange your divs, align your margins, and check it on your browser, just to be sure that it’s correct.
There’s a huge hole in your layout, and you have no idea why! What do you do?
Well, you could read Part 1 of this series, and feel some solidarity with others (including me) who share your dilemma. Or you could read part 2, and employ some of the ie hacks I suggest, add an ie specific stylesheet, or try some different CSS tricks. But when none of that works, and you just cannot find the source of the ghost, you will need some serious tools.
As I have mentioned before, we test on many browsers at MoreVisibility, so our machines host a range of browsers ad web developer tools. Web developer tools are available for most browsers and are indispensable when troubleshooting layout. There is absolutely no way I would be sane, or even still working today, without my trusty Firefox and ie add-ons.
This fantastic little tool was introduced to me by a colleague, who convinced me to ditch my old trial-and-error ways. With one click inside Firefox, a dialog box pops up within the window, showing the HTML and CSS side by side. You can mouse over an element to bring up its code, and what’s more, CSS styles are shown in hierarchical order, giving you a view of which styles are at the top of the inheritance pile. A great asset when you can’t figure out where that underline is coming from!
Mozilla Web Developer Toolbar: http://chrispederick.com/work/web-developer/
This toolbar isn’t as universally useful as Firebug, but it does pack a host of valuable tools. You can check the actual (not stated) size of an element, use the ruler and generate a great image report with a list of images, their sizes and urls.
Internet Explorer Add-on
There is only 1 proprietary application for ie in this category, and it works for all versions of ie http://www.microsoft.com/downloads/
It combines some of the tools available from both Firefox add-ons. You can view the html, but the CSS panel is often code-bloated or unspecific. There is a ruler, an outline feature, and some image options.
Ie Web Developer is essential in troubleshooting the browser, but even with its aid, you may still need to use your wiles and common sense. Happy Hunting!