Who ate Google?
When we think of the web, today the web is synonymous with search engines–in particular the three that top the list are Google, Yahoo!, and Bing. But in most people’s minds, there is really only one that is the Windex of the search world, and that’s Google.
I remember back when I was in high school, how we used search engines like DogPile, as these types of search engines would aggregate multiple different search engines results together. The rational for doing that was back then, it was a much different landscape in terms of searching the web and no one provider seemed to do it well. You would have Webcrawler and Yahoo! and others, but their results seemed to vary and one search would yield a variable cornucopia of “stuff”, maybe stuff we intended to find, and maybe stuff that seems so far off the beaten path you wonder to yourself, “How in the world is that relevant?”. Then along came Google and the game changed.
I remember many years ago a good friend of mine saying, “Hey have you tried Google?”. My first thought was, “No, it’s just another search engine”. But low and behold that turned out to be the understatement of the century, as Google as grown to be the household name when it comes to searching. Additionally it has become one of the de facto information providers for companies to do research–especially when it comes to programming errors, server errors, errors that appear in logs, etc. Most techies are accustom now to doing a Google search on the issue to see if the community has experienced the problem. So ingrained has Google become in the world of troubleshooting a problem, LifeHacker’s recent survey on Best Computer Diagnostics, has Google listed as the number one tool, beating out several top rated contenders. Clearly Google has changed the way that we look at the web.
Today Google announced that they are evolving once again. This is on the heals of Apple announcing that the iPhone4 and iOS4 have “This changes everything. Again“. Google announces that they have added Caffeine to their indexing services takes them closer and closer to real-time indexing 0f web content. How does this change things?
I remember the days managing tech support for a top Web Hosting company, and almost daily you’d see the question asked by a customer “I have submitted my site to Google to be indexed, but why is it not appearing on the search?”. Many today would say, “Well maybe you should pay for it through AdSense like everyone else”, but fundamentally this shouldn’t be true. Web searches should allow a user to search the web and find not just the glitzy gimmicky results that companies and merchants want you to see, but also stuff that is related to your topic but provided by lesser known sources. That’s the beauty of the web, no longer is content pushed out to the users by the people who have the most resources, best logistics, or publishing power, but that anyone can publish something to the web and that information can be consumed by an indexing service and served out at a later time when someone enters a string that has relevance. So back to the question that was posed above, what happened?
Back then, Google reindexing services did take several days if not weeks to index your site. Additionally users would have to make sure that their sites were Search Engine Optimized (SEO) to ensure that they had the best chance of the indexer picking up the right content and matching it to relevant search strings. But even if a site was SEO optimized, the indexing process was still slow. I recall many instances where we would have to look through access logs and such to verify to a customer that “Yes your site was indexed, look the Google crawler was accessing your content”. This was a very frustrating thing for site owners who did not have the capital to spend hundreds or thousands of dollars on professional SEO services and/or AdWords/AdSense to improve their rankings.

How does Caffine change all of this? Well this is best explained on the Google Blog for the technical minded folks, but for everyone else, what Google has done is to take the chore of indexing the web from indexing large chunks of information that was tedious and time consuming, to much smaller and more frequent chunks of information to enable them to refresh content that would otherwise take much longer to be indexed. Just how much information are we talking about? Generally speaking each webpage (omitting all the funky addons such as Flash, movies, publication grade images), the webpages themselves are only several KB each. From what Google is saying, Caffeine has the ability to index hundreds of thousands of gigabytes of data a day! Assuming that each webpage is about 5KB in size, this is about 20,971,520,000 individual webpages a day! Pfew! Needless to say that’s a lot.
Given the size of their databases needed to store all this information and the size of their server farms, it does beg the question, where’s the upper limit here? How much of the web can Google store because at some point in time, the web will live in the Google cloud. How are other search providers going to compete with this level of near real-time indexing? And when looking between the lines, is this a way for Google to improve their ability to news aggregate and read in other news sources to provide more timely real-time propagation of breaking news as it’s happening?
When we thought the glass ceiling was hit, Google took the ten-ton hammer and smashed it to bits. Now by raising the bar, I wonder who has the ability to beat Google or is this just the beginning of an empire destine to rule the web for the next several decades?