Article Archive by Lee Zoumas


January 21 2010

Page Name Canonicalization and the htaccess File

by Lee Zoumas

Sometimes the same page in a website can get indexed multiple times, which could potentially create a duplicate content issue and penalty. The best example of this is a website’s default or home page…

http://www.domain.com/
http://www.domain.com/index.htm

Although both of these URLs resolve to the same page, the search engines could index both of them, or possibly one and not the other. However, situations like this are not just isolated to the home page. Most websites will have default filenames in URLs contained in subdirectories like this…

http://www.domain.com/about/
http://www.domain.com/about/index.htm

… which causes the same issues as the home page.   Additionally, sometimes the internal page linking structure could link to the page with or without the index.htm filename present. To prevent the default page from being accessed by its filename, we can add the following rule to our .htaccess file:

RewriteRule (.*)(index|home|default).(html|asp|aspx|htm|php)$ $1 [NC, R=301]

With this rule in place, when any page is requested with the default filename (index.htm) in the URL, the user will get 301 redirected to the default page without the filename. This will ensure that the default filename is never in the URL and that only one version of that page will get indexed by search engines.

December 17 2009

Domain Name Canonicalization and the .htaccess File

by Lee Zoumas

The .htaccess file is the main configuration file for URL Rewriting software, such as Apache’s mod_rewrite and Helicon’s ISAPI_Rewrite. An .htaccess file can be used to perform many different SEO-related tasks. Whether or not your web host allows the use of the .htaccess file can mean all the difference in the world when planning an SEO strategy for your website. In all of our client projects, we use the .htaccess file to perform some SEO-critical functions. One of the most important functions that the .htaccess file can perform is domain name canonicalization.

Domain Name Canonicalization

If a domain name is not canonicalized, it means that the same site will be presented to the browser when different combinations of a domain are requested. For example, consider the following urls:

http://www.domain.com
http://domain.com

While both of the examples above look the same, they are in fact quite different. Search engines may regard them as different URLs altogether. As a result, some pages may get indexed under the www version, while others may get indexed under the non-www version. One way to ensure that search engines will only index one version is by adding the following rule into your .htaccess file:

RewriteCond   %{HTTP:Host} ^domain.com$
RewriteRule   (.*)    
http://www.domain.com/$1 [QSA]

With that rule in place, when the non-www version of the site is requested, the user will be redirected to the canonicalized www version. It should be noted that this rule will not just work for the homepage, but all pages within that domain. For example:

domain.com/page1.htm

… will redirect to this…

http://www.domain.com/page1.htm

As you can see, that rule is pretty powerful. In a future post, I will demonstrate how the .htaccess file can be used for page level redirects.

November 12 2009

3 Things Needed to Keep Your Pages from Being Indexed by Search

by Lee Zoumas

Sometimes you may have certain pages on your website that you do not want indexed by search engines. Just recently, we developed an online order form for a client that should not show up in search engines. To ensure that a page does not get indexed or crawled by search engines, it is important to do 3 things:

1. Add a rule in your website’s robots.txt file. Assuming the page you don’t want indexed is order.htm:

 User-agent: *
 Disallow: /order.htm

2. Add a “noindex, nofollow” robots meta tag in the head section of the page that you don’t want indexed or crawled:

no-index-nofollow  

3. For each link leading to the page that you don’t want indexed or crawled add the following “rel” attribute to the anchor tag with a value of “nofollow”:

nofollow  

That’s pretty much it. It should be noted, that until recently, it was thought that you only had to add the rule to your robots.txt file to prevent pages from being indexed and crawled. However, Matt Cutts from Google gave some insight stating otherwise in a video posted on his blog (http://www.mattcutts.com/blog/robots-txt-remove-url/).

© 2020 MoreVisibility. All rights reserved.