Google Search, SearchGPT, and SerpAPI
Everyone always forgets about Perplexity - The risk of the (wo)man in the middle.
The letters on the screen read “ModuleNotFoundError: No module named 'pylibcudf'". I check the clock at the top of the screen. It’s 11 PM. Still, nothing works. I have no clue what pylibcudf does. Amidst the random assortment of letters lies the potential for meaning; chaos holds within it the seed of creation, waiting only for the right mind to find it. But today is not that day. Muscle memory. CTRL-T, “G”. The prompt prefills with “google.com”. Backspace. CTRL-V.
No Google. I really meant “pylibcudf”. Pages solving for “ModuleNotFoundError” populate the screen. Stackoverflow. Pythonforum. Stackoverflow. Github. Reddit. I put "“pylibcudf” into quotation marks.
Seriously, Google I meant “pylibcudf”. Where can I learn about this package? I click on the second link. Stackoverflow. I read. Copy some code. It’s for cuda11. I paste it into Jupyter and adjust it to 12. I run into a version incompatibility. Uninstall all of cuda11 and ensure there is a clean cuda12x installation. It works.
My query has become one of Google’s 3.5 billion daily search queries.
Search
Over the last decades (that’s centuries in Internet terms), Google Search has always been an integral part of my life. Google had at some point in time become so good that I can’t even remember the time when I had to click on page 2 of the Search results. But Search has increasingly become unreliable. SEO vendors have become more sophisticated and capable of raising their rankings with junk content. At the same time, content on the Internet has exploded to billions of web pages.
What is Internet Search if you start from first principles?
At its core, the goal of Internet Search is to identify and organize information from credible sources and present it in a way that best matches the user's intent (more than that a bit later). Internet search operates on first principles based on three broad activities, Information retrieval, Indexing, and Ranking.
Information retrieval starts with systematically scanning websites ("Crawling"), where bots collect data on each page's content, structure, and metadata. Site operators have noted that AI companies have done the same thing yet more aggressively. Since this world is producing new and interesting events each second of every day, the crawling of the whole problem space needs to happen regularly to keep up. Because AI companies are crawling websites already at low frequency to gather data, from a business strategy perspective, also offering indexing and ranking as value-adding services to users is just a few steps away from vertical integration.
Google search traffic for this Substack. Shows how terrible I am in SEO.
"Indexing," is tasked to optimize the data for quick retrieval. Each page is assigned keywords (“tokens”) that relate to its content, and these are linked back to the page's URL. These keywords can be found in the meta-tags of a website, but I would assume that most search companies have their own method. In recent years, algorithms like BM25 and Embeddings have become more efficient giving birth to Semantic Search.
After indexing, Search Engines assess keyword relevance, page quality, freshness of content, and page authority to define a rank. It appears that Google now optimizes for ad revenue to the detriment of its ranking quality. There is a high likelihood that we have to thank Shiv Venkataraman for this.
Unreliable LLMs - Answer Engines vs Search Engines
The Internet has changed. Already roughly 30% of the Internet is run by bots with increasing tendency.
In addition, the way we access the web has changed. We have now entirely different “front-pages” to the Internet for Entertainment, Education, Social Gatherings, and Shopping.
It is interesting to note that all of the top websites in the world provide user-generated content, and in many cases, they are deeply intertwined.
What that means is that a substantial subset of key Internet data is now generated on decades-old social data of uncertain quality. The same data that was used to train generations of LLMs (GPT-2 , GPT4)
Sure, over time these models get better, but in the meantime, this leads to hilarious (?) responses when searching for questions like “How many rocks should I eat each day?”.
Therefore, from a “Search as a Product” perspective, I would conclude that Google is vulnerable. Well, as vulnerable as a company is that dominates 90% of Search worldwide is.
SearchGPT
In breaking news of yesterday, OpenAI launched SearchGPT which many people deem a Google killer. But I think these people are wrong. Not only because Google has a 90% market share. ChatGPT is a product on top of the GPT-family of LLMs and now as a product gained the capability to search the web for up-to-date answers. For everyone who is into autonomous agents, that should not come as a surprise.
If you read my Substack, you are way ahead of the game (Tavily, YouTube, LanceDB, and Tool use). Especially the latter one in this list explains an easy way to run a local Answer engine on your computer. And this is what they are. While colloquially we refer to solutions like AI Overview, SearchGPT, or Perplexity as Search engines, they are actually not. They don’t rank and present this sorted list of results to the user. They retrieve a query, assume its intent, extract key information relevant to the query, and then summarize their “findings” based on their training to the use. In order to do that well, they have to understand the intent of the query. Intent is the silent force driving a wide variety of recommendation, attention, and search algorithms in 2024. Finding intent is hard because it’s like meeting a guide that walks with us through the digital fog toward some faintly remembered purpose, even if we can’t quite articulate it.
In the famous “attention is all you need” paper, it is explained that the attention mechanism allows the model to weigh different parts of the input data differently, depending on their relevance to the task at hand. By identifying the intent behind specific inputs, the model can dynamically adjust its attention distribution, enhancing its ability to process information effectively. And that’s why assuming intent is so important. But in many cases, users don’t know their actual intent. So having a UI that can be refined like ChatGPT is really important.
Google Research explored the nature of intent for the use case “programmers search for code”. And, as we can see in the table below, about 60% of search queries were related to receiving and reviewing code samples.
source - How Developers Search for Code: A Case Study
All of this can already done in a much more effective way through Anthropic’s Claude, Cursor, or Microsoft’s CoPilot. And I think that brings me to the future of Search.
In closing
I'm thinking that we are combining two broken systems - compromised search engines and unreliable poorly guard-railed LLMs and I am uncertain if this is the right way forward. We need different paradigms from a UI perspective to work with the retrieved results. And I don’t mean only better summaries we need better sources. We can observe that user-generated content is in many cases intellectual junk food. Fun, easy to consume, easy to forget. Tailor-made to cover the most recent controversy of the day. I believe that API-driven search engines like SerpAPI and DuckDuckGo will take a much more important role in providing neutral datasets to an agent that don’t cater to the sensitivities of a Californian 13-year-old. The risk is that we introduce a non-neutral actor with agency between the search result and the user. In the past, I could scroll down to find the source I was looking for. As a human, I could discern what I wanted to explore further. This is now all opaque and hidden.
What if OpenAI will prioritize News Corps’s posts above others? What if the developers of the agent prioritize a certain gender or race that has nothing to do with the quality of the post? We should build tools that make us smarter and more robust in our decision-making and not influence us and making us find the right information harder.
Anyway, that’s all I got for you today. Rant over.