How Google WorksPosted June 18th, 2012 by Nate
What really goes on behind the scenes when you Google? There are three ways to answer that question. The simplest, which is suitable for very small children or the average CEO, is that Google looks through every website in the world and finds the ones that Google thinks you’re looking for. This is very wrong, of course, but just try explaining why the sky is blue to the average CEO. A more reasonable answer is that Google takes a quick snapshot of every website in the world, paying special attention to the important words and phrases. When you type in one of those special words and phrases, Google looks through the snapshots and gives you the closest matches, whose rank is calculated based on a Secret Google Algorithm known only to a select group of Ascended Page Rank disciples at the Google Temple in Mountain View, California. Okay, minus the psuedo-religious / Illuminatus undertones, this is actually a good enough explanation to satisfy many professionals whose jobs depend on understanding how Google works. But it’s still alarmingly oversimplified.
Step One: Release the Robot Spiders.
Google deploys a Googlebot, which is Google’s term for an automated web crawler or “spider” that is tasked with scouring the web for pages and links. The spiderweb analogy is fairly appropriate, as the spider crawls from site to site, building a web of links from one page to another, and collecting keyword info and other relevant data about the page in question. The first dash of Google’s Secret Sauce goes into the Googlebot’s programming. An algorithm determines which sites the spider visits, how often, and how “deep” the spider dives into the site (i.e., how many pages the bot will look at from the site before moving on). The more pages your site has, the deeper the Googlebot needs to burrow — and the less often the spider will bother to try. Big sites may get a spider visit once a month; smaller sites may be infested once every day, or even more often. Another key element is change. If your site doesn’t appear to have updated in some time, the Googlebot will lose interest and not bother to visit again. But “fresh crawls” tend to be much more superficial than deep crawls, so if you have a newspaper site or stock ticker, don’t expect every page on your site to get a deep crawl every hour.
Indexing the Internet.
When Google’s flock of flying monkeys return to the castle, they bring vast amounts of data to be crunched. This is when “indexing” occurs, when the data gets dumped to Google’s massive banks of servers. The focus then turns toward getting that data into proper shape for storage in the database. Crucially, the keywords’ frequency and placement on the page is also noted. For example, think about searching for “hammers”. A page that uses the word “hammer” once, in the second-to-last paragraph, is probably not as useful as a page that uses the word fifteen times, beginning with the title at the top of the page. In fact, the title gets an extra glance from Google, as does any other “metadata” that describes the content on the page. This comes in handy when figuring out how to rank pages, but we’ll get to that in a moment. To speed things up, Google tosses out common and generally useless short words (don’t bother searching for “the” and r” unless it’s a necessary part of a title or phrase), along with many numbers and punctuation marks. It also turns capital letters into lower-case ones, and comes up with likely misspellings of words to help those who are pr0wn to typos. A weakness in this method (depending on how you look at it) is that not all content can be crawled and indexed so easily. For years, Adobe PDFs were unreadable by Googlebot, so many important web documents were simply not being indexed. Through a combination of metatags and OCR, Google Search now includes PDFs (not all, but many). A similar problem persists with a number of other “opaque” elements, from entire websites powered by Flash to many types of embedded multimedia (or “rich media”) content. What Googlebot can’t read, your site doesn’t get any credit for.
Serving the Search Meal.
When you type in your search terms, Google compares them to the data in the index, goes back to grab a copy of the whole page that the index refers to, and spits out a SERP (Search Engine Results Page) consisting of links to and a brief description of the relevant pages. The database is made efficient due to parallel processing, which throws any given search at thousands of computers at once, allowing them to distribute the task between them and come up with answers much more quickly. The PageRank, the order in which the results are listed is the main ingredient in Google’s Secret Sauce, an algorithm which Google says consists of over 200 individual factors. Between what Google chooses to tell us and what has been reverse-engineered by SEO pros and curious hackers (not always an either / or population), we’ve learned some of the important factors. But many others are still a mystery, and Google infamously changes the recipe every so often to preserve the ‘organic’ nature of the results.
- The number of links is crucial, as is whether they are inbound or outbound links. Think about it this way: a link is a reference to somebody else’s info, an admission that someone else knows more about a subject than you do. So a page which has many outbound links to other pages is not as authoritative as a page which has plenty of inbound links but is stingy with its outbound links.
- Obviously, this is not the whole story; after all, you probably wouldn’t think much of an academic thesis paper if it didn’t back up its assertions with proper references. “Dangling links” or links that go to a page without any outbound links, are often actually worse for your Page Rank.
- Google tries its best to determine whether the links are relevant, useful, and valid, rather than simply unnecessary links or intentional spam links meant to game the system. It also takes into account where on the page they are located, how they “interact” with other links on the page, and also how they interact with keywords.
- Keywords — including both single words as well as phrases — are probably equally important. Like links, the order and placement make a big difference to PageRank. Few people know the exact way that Google ranks keywords, but similar order of words and proximity to other keywords seems to be a safe bet.
- Keyword placement in the content area has a different importance than in structural locations. This includes the crucial URL, title, header, subheadings, and footer areas, as well as in links TO the page.
- Date is quite important, too. Pages with a long history are much more likely to appear at the beginning of the SERP. Amazon will always beat that blog site that you registered yesterday, but due to popularity, relevance, and relocation on new servers, pages like Tim Berners-Lee’s 1990 NeXT site might be less likely to appear than a viral website from 2011.
- Change is just as important at this stage of the game as it was with Googlebot. And we don’t mean ANY little change; simply throwing in a few synonyms or an extra word here and there won’t fool Google. They’re looking for substantive changes in the content.
- In fact, spammy activities of all kinds risk Google’s wrath. The quickest way to get bumped way down the list, or thrown off the rankings entirely, is to engage in activities that Google considers suspicious or downright manipulative. Without examining every site for such nefarious acts, Google’s attention is drawn by sites that move up the rankings too quickly, or pick up too many links in too short of a time. “Keyword stuffing” is another no-no; Google will penalize any page that is filled with nothing but repetitions of keywords on Google’s list.
In short, those who want to get better Page Rank should concentrate first on providing quality content before trying to optimize the content to get better placement on the SERP. Just as there are Black- and White-Hat Hackers, there are Black- and White-Hat SEO people — you can usually tell the difference by who complains most when Google changes the algorithm. So there is actually a fourth answer to the question “How does Google Search work?” That answer is really only known by a handful of people who possess both intimate personal knowledge of Google’s secret sauce and the brains to understand all of the complex relationships between the elements. But mere mortals simply need to know why the sky is blue, not how to calculate the exact refraction of the wavelengths of spectra as it passes through the Earth’s atmosphere.
Previous post: The small text you overlooked on that “Great Internet Deal”
Next post: How to spot a scammy website