Articles in the SEO & Technology Category

Fixing un-canonical URLs. Oh joy! Part 2

August 19th, 2008 by Jordan Sandford

Welcome back to my series on fixing canonical URL issues. Here again are the areas of canonical URL issues:

1. Protocols (http and https)
2. Domain and subdomain names (sometimes referred to as host names)
3. URL paths
4. File names
5. Case sensitivity (when myPage.html is handled differently than MYPage.HTML)
6. Query strings
7. Combinations of any or all of the above

In my last post, I discussed how protocols (http and https) can present un-canonical URLs to the search engines and how it can create duplicate content. Let’s pick up where we left off.

You have two domain names, example.com and example.biz. You want traffic to example.biz to see content at example.com. Your hosting company set up your web hosting account on their servers to be able to show visitors to www.example.com and visitors to example.com using the same files (that way you don’t have to maintain two versions of, say, the Home Page). This is the default way most hosting companies create new accounts.

To fix canonical URL issues related to different top-level domains (e.g. edu, com, org, us, etc. — look out for anythinggoes top-level domains as well), domains and/or different (or no) subdomains, you can set up your server to show content from the non-canonical domain(s) to the visitor while, at the same time, that content is banned from being indexed by the engines (using a robots.txt file or the robots meta tag) or the visitor and search engine needs to be properly redirected to the canonical domain. First, choose which subdomain/domain/top-level domain combination you want to be canonical. Set up the web servers or hosting accounts that host the non-canonical domain to ‘301 redirect’ to the canonical domain using the same rewrite rules or the ‘include’ method I previously discussed. (In a future post, I will discuss URL rewriting on Apache servers and compare it to URL rewriting on Windows servers.) Be aware that if you bought multiple domain names from a registrar, only your canonical domain may be actually hosted, while the other domains may be using their ‘forwarding’ service to redirect to your canonical domain. If you use their forwarding service or even their ‘301 redirect’ feature, they may not implement a 301 redirect consistently or properly. I am speaking from first-hand experience with well-known hosting companies.

You were categorizing your pages and realized that you accidentally placed a page in both the /blog/Colors directory and the /blog/shapes directory. This could happen from physically copying a file to another directory or perhaps you are using a blogging application (or any web application for that matter) and categorized a post in two categories. In the latter case, if the blogging application does not handle cross-posts in an SEO friendly way, you might have duplicate content issues.

As far as the URL path goes, it would be a good idea to know which URLs have duplicate content of other URLs on your site. If you don’t know, try the tools offered by Google’s Webmaster Tools. Different web applications and different types of web applications (such as blog software from vendor A vs. vendor B or CMS software from vendor C) handle canonical URL paths differently. Check what web application is powering the pages at those URLs. Also see if there are add-ons or plugins for your software that can handle duplicate content issues created from assigning multiple tags or categories to content. They could add a robots meta tag to one of the duplicate pages.

Other ways to handle this is to 301 redirect one of the duplicate pages to the other. In a pinch, and depending on the URL structure of your site, you may be able to use the robots.txt file to exclude a certain section or an exact URL of your site from the indexes, thereby removing the duplicate content while making sure that any other page in that section is not removed from the index. You want to be careful which URL you redirect from or block using the robots.txt file, because one of those URLs might be more optimized than the other.

Please remember to read my next post in this series about file names, the mysterious forward slash at the end of URLs and case sensitivity in URLs. Check back soon!

Posted in SEO & Technology | No Comments » |

Some Top Tools for SEO Newbies

August 14th, 2008 by Darren Franks

What does one do when trying to learn the complex, ever evolving world of SEO? Is it as hard as explaining what an SEO engineer is to your friends and family? Definitely not, as there are some awesome free tools to both guide you through the world of SEO and to help you in the process of making the most Search Engine friendly site possible.

It’s amazing to me the vast array of “stuff” out there when putting yourself through the exercise of getting the best SEO learning experience.

While searching through the milieu of learning tools, I stumbled across some keepers. These may seem basic and rudimentary to most of the seasoned pros out there, but they serve as a great foundation for those newer to the field.

I avoided the obvious suggestion of just reading “Search Engine Optimization For Dummies” as I‘ve never read it nor plan to, but in no particular order here are some top tools for SEO newbies:

Lynda.com (visual online training)

For a mere $25 a month, you get access to a whole suite of techie articles, but for the purposes of this blog post, the SEO class by Richard John Jenkins is really informative. The entire class lasts about 9 hours and you can learn at your own pace. The class itself is a few years old (he talks about Yahoo Overture as something that has yet to be phased out), with that aside, it’s a great learning tool. Mr. Jenkins talks in layman’s terms, so it’s easy to follow for all the newbs.

Webconfs.com

A very cool website that combines tutorials and free SEO tools. There’s even an SEO quiz! Tools include Search Engine Spider Simulator, Backlink Checker and even a Screen Resolution Simulator!

Sempoinstitute.com

Everyone knows about this I’m sure. Created to provide education to the industry and to promote growth in the field, it’s a great site for both beginners and veterans.

I could go on and on about the plethora of sites out there, but these are a few you should find helpful!

Posted in SEO & Technology | No Comments » |

Fixing Un-Canonical URLs. Oh joy! Part 1

July 25th, 2008 by Jordan Sandford

In my previous post, Filenames, host names and canonicalization, oh my!, I talked about how the duplicate content issues affect your search engine rankings, and specifically how un-canonical URLs create this issue. I mentioned different types of URL canonicalization issues you are likely to deal with in your SEO work. Since that post, I created a new, more complete list of canonical URL issues, and will go into a bit more detail in this and future posts describing how the issues actually arise and how to fix these issues.

Let’s start with a detailed guide on how to pronounce canonical: ca • NON • uh • cul. Also, canonicalization is pronounced: ca • non • uh • cul • i • ZAY • shun. That was easy!

Without further ado, here is the new list of areas of canonical URL issues:

1. Protocols (http and https)
2. Domain and subdomain names (sometimes referred to as host names)
3. URL paths
4. File names
5. Case sensitivity (when myPage.html is handled differently than MYPage.HTML)
6. Query strings
7. Combinations of any or all of the above

Let’s break these down using an example URL:

https://example.org/blog/Colors/default.asp?action=go&sessid=6468439#section3

1. https is the protocol
As far as search engines go, there is only http and https.

2. org is the top level domain

3. example.org is the domain name
No subdomain is used. (Technically, www is a subdomain just like any other subdomain.)

4. /blog/Colors/default.asp is the URL path (everything after the top level domain, including the file name, and before the query string, if any)

5. default.asp is the file name
Many times, the URL path may not contain the file name at all and will just end with the forward slash (e.g. /blog/Colors/).

6. ?action=go&sessid=6468439 is the query string (everything after the question mark and before the # sign, if any)
The query string is made up of parameters and their values. “action” is parameter and “go” would be its value.

7. #section3 is the URL fragment (also known as a named anchor or bookmark)
A URL fragment tells the web browser where inside a specific web page to go to or to scroll to. It does not tell the web browser (nor the search engine) which specific page to go to, so therefore does not contribute to any URL canonicalization issues.

To illustrate each of these, here is perhaps the worse case scenario as it relates to canonical URLs (I will continue to add to this scenario from post to post in this series):
You have a site that has forms visitors fill out. You have an SSL certificate and people can go to (https://www.example.com) to see your site. Now, you want to give your visitors peace of mind by letting them know that your site is secured (by making sure that browsers see the lock icon that indicates a secure site). When your https version of your site is setup, the web server may pull files from the non-secure section, but send them back to the browser over a secure connection. (This ‘usability feature’ provides the convenience of only having to maintain one set of files instead of two sets.) Now, when visitors go to https://www.example.com/blog, they’ll see the same content as http://www.example.com/blog. This is your first duplicate content issue.

There are different ways to handle protocols when dealing with canonical URLs. You can block all access to the secure version of your site using a robots.txt file. However, if your web server or web hosting account is setup to serve your secure site from the same files as the non-secure files, search engines will see the regular robots.txt file when going to https://www.example.com/robots.txt. The most flexible way to circumvent this is to create a version of the robots.txt file that’s used only for https and name it appropriately. Then use rewrite rules to internally redirect all requests of https://www.example.com/robots.txt to robots-ssl.txt. The search engines will still think it’s looking at https://www.example.com/robots.txt, and the industry standard. If your site is running on an Apache web server, see apache.org’s URL Rewriting Guides and yourhtmlsource.com’s URL Rewriting Guide.

Another way to handle this will only work if all your files on your web server can run server-side scripts. Usually, this is the case with .asp, aspx, .cfm or .php files. If you have .htm or .html files, you may be able to ask your web hosting company to allow server-side scripts to be ran in these files. Once all files on your web server can run server-side scripts, make a script that checks if the file was accessed via https, and if so, add a robots meta tag that disallows the page from being indexed. This script needs to go in every one of your files, either by ‘including’ them or pasting the script in at or near the beginning of the page. The file-include method can drastically reduce the time it takes to administer this script since you only have to make changes to the script in one place; all files that are including the file that has changed will be essentially updated automatically.

My next post will discuss canonical domain names, including subdomains, and URL paths. In following posts in this series, I will provide methods that maintain the usability features of the web (for both the webmaster and the visitor) yet prevent the duplicate content/canonical URL issues. Stay tuned!

Posted in SEO & Technology | No Comments » |

« Previous Entries Next Entries »