Behind the scenes, there are a lot of components that keep a website up and running. It’s possible, and maybe even practical (depending on your business situation), to launch and run your website without many features enabled. Doing so will give you a functioning website, but it won’t give you an optimal one. Here are just a few examples of features you should be using on your site for SEO:
XML Sitemap: Take a look at any well-optimized site, and you won’t have to look far to find a sitemap. It’s one of the most basic features you can add to your site for optimization. The sitemap let’s search engine bots crawl your site more easily by providing a list of available URLs. Making your site easier to index helps it rank better. If you run a large website that updates frequently, you will want to make sure that your CMS is set to automatically update the sitemap when new content is posted.
Robots.txt: You’ll want a robots.txt file for identical reasons as the XML sitemap — to help crawlers do their job more efficiently. Google places a limit on the number of pages it will scan for your site. This can be problematic if you are running a large site — Google will never look at some of your pages. Furthermore, you can not tell Google which pages to index. You can, however, tell them which pages to ignore by using a robots.txt file. This increases the chances that Google will index only the important pages of your site.
Google Analytics: Although not necessary for the operation of your site, a Google Analytics account is necessary from a marketing perspective. If you want to manage your growth and impact on the Internet, or the success of a campaign — the best way is through analytics. Furthermore, the sooner you set up an account, the sooner you can begin collecting data to refer back to when running future campaigns.
Sometimes you may have certain pages on your website that you do not want indexed by search engines. Just recently, we developed an online order form for a client that should not show up in search engines. To ensure that a page does not get indexed or crawled by search engines, it is important to do 3 things:
1. Add a rule in your website’s robots.txt file. Assuming the page you don’t want indexed is order.htm:
User-agent: *
Disallow: /order.htm
2. Add a “noindex, nofollow” robots meta tag in the head section of the page that you don’t want indexed or crawled:
3. For each link leading to the page that you don’t want indexed or crawled add the following “rel” attribute to the anchor tag with a value of “nofollow”:
That’s pretty much it. It should be noted, that until recently, it was thought that you only had to add the rule to your robots.txt file to prevent pages from being indexed and crawled. However, Matt Cutts from Google gave some insight stating otherwise in a video posted on his blog (http://www.mattcutts.com/blog/robots-txt-remove-url/).
While writing several blog posts and documentation, I often have used example.com to stand in for any domain name. One of the Internet standards established by the Internet engineers circa 1999 set aside example.com (as well as example.org and example.net) for documentation purposes. So if you were to click on a link to http://www.example.com in my post, you wouldn’t see an actual web page. Click on this link to see for yourself.
I’d like to demonstrate a fun little trick you can use to amaze your friends.
The page you see is when you go to http://www.example.com is completely indexable by the search engines. There’s not a lot of content, but you would think that the engines will have indexed the content exactly as your browser shows it to you. It turns out that there is a robots.txt file that blocks all spiders from all content inside www.example.com. (If you ever forget how to create a basic robots.txt file, you can use this one as a guide.) Alright, now for the punch line. Let’s see what the search engines really have indexed for http://www.example.com. Go to www.google.com and type “site:example.com” (without the quotes). What do you see? If you see only one result, click on the link: repeat the search with the omitted results included.
I see 10,400 results now. There are pages like example.com/blah/ and www.example.com/concepts. The Google search results page does not have links to the cached version for any of these results, unfortunately, so we can’t see what exactly Google has indexed from these pages, but we can go to the page ourselves. Well, I tried that, and every page I go to replies back with “Not Found.” It’s logical to conclude that those pages never existed, but also notice some of the results have been crawled by Google in the past few hours. Impossible, no?
You can try this search on other search engines too.
My feeling on this strange phenomenon is that it could either be Google’s own testing or other people testing or somehow tricking Google into adding these pages to its index. It may be relegated to certain data centers as well.
Whatever is causing this, I’m sure Google knows about it, but doesn’t feel the need to do anything about it. This phenomenon may also get you thinking about how search engines are supposed to work.