Setting up a web crawler in the Google Search Appliance is a piece of cake — you enter a starting URL and some boundaries and let it rip. The GSA will spider its way around until it finds every reachable page in the site. For a well-structure site, this usually produces very good results, but not all sites are created equally. While the GSA does have features for detecting cyclical loops and excessively redundant pages, it will often find significantly more pages than you expect. This can cause a reduction in the quality of your search results. The extra pages can dilute the relevancy of higher-quality pages, making it difficult to find the desired results.
In the worst-case scenario, the GSA index size will reach the licensed limit, resulting in only part of your site being indexed and searchable. In this situation, the GSA starts evicting pages to make room for presumably better pages, but in practice, the eviction algorithm is not perfect and can result in essentially random pages being removed. Regardless, eviction or truncation is not a happy thing and you will want to take action to fix the problem.
Read More