Duplicate content is a hot topic and has been for quite a while. It is also one of the most misunderstood issues in search engine optimization. Many webmasters and even some search marketers spend an extraordinary amount of time and resources trying to avoid the dreaded “Duplicate Content Penalty”, when in fact a penalty derived from duplicate content is fairly rare and reserved specifically for sites which have been observed trying to manipulate search engine rankings directly; i.e. search engine spammers.
The more common issue associated with duplicate content found by search engines is the “Duplicate Content Filter”. When a search engine finds two or more pages with identical or even nearly identical content it applies a filter which allows only one instance of the content to be returned in search results. This is in no way a penalty and does not affect the site in whole, just the specific page as it relates to the specific search query. The goal of the search engines is to provide their users with “unique” content and this filter helps to ensure each page returned in the search results is unique.
In the past couple of weeks Google has published an article with some very specific information on how it sees and handles duplicate content as well as some bullet points on issues to watch for concerning duplicate content. Additionally, another new US Patent relating to identifying and handling duplicate content has been granted to Google.
“During our crawling and when serving search results, we try hard to index and show pages with distinct information. This filtering means, for instance, that if your site has articles in “regular” and “printer” versions and neither set is blocked in robots.txt or via a noindex meta tag, we’ll choose one version to list.”
The first week of this year, Google was awarded a U.S. Patent related specifically to the identification of duplicate content called “Methods and apparatus for estimating similarity” Two other Google patents related to duplicate content are “Detecting duplicate and near-duplicate files” and “Detecting query-specific duplicate documents” While each of these document are extremely technical in nature, some good information can be gleamed by taking the time to read and understand them
The primary point to keep in mind when talking about or evaluating your site for duplicate content is “intent”. Perceived intent will be the determining factor in the engines choice between “Filter” or “Penalty”. While both have similar short term affects on a site, a filter can easily be overcome by addressing the issues causing the duplicate content. A penalty however, requires the problems be addressed along with the completion of the engines re-inclusion process which can be a lengthy and frustrating endeavor.