Statistical Search

In addition to power our semantic APIs, Sturdy Statistics’ probability model also powers our statistical search engine. When you submit a search query, your index model maps your query to its thematic contents. Because our models return structured Bayesian likelihoods, we are able use a statistically meaningful scoring metric called hellinger distance to score each search candidate. Unlike cosine distance whose values are not well defined and can be used only to rank, the hellinger distance score defines the percentage of a document that ties directly to your theme.

This well defined score enables not only search ranking, but semantic search filter as well with the ability to define a hand-selected hard cutoff

Focused Example: Google’s Discussions about ‘FX’

We are using two new capabilities in the index.query api: filtering and search. Our query API supports arbitrary sql conditions in the filter. We leverage DuckDB under the hood and support all of DuckDB’s sql querying syntax.

Search Parameters

In addition to accepting a search term, our query API accepts a semantic_search_cutoff and a semantic_search_weight. The semantic_search_cutoff is a value between 0-1. The value corresponds to the percentage of the paragraph that focuses on the search term. Our value of .1 below means that at least 10% of the paragraph must focus on our search term. This enables flexible semantic filtering capabilities while providing sensible defaults out of the box.

The semantic_search_weight dictates the weight placed on our thematic search score vs our TF-IDF weighted exact match score. Each use case is different and our api provides the flexibility to tune you indices according to your use-case while providing sensible defaults out of the box.

Semantic Search Results

In the examples below, you’ll notice that our index surfaced paragraphs that matched not only on FX, but also on foreign exchange, pressures, and slowdown.

SEARCH_QUERY = "fx"
FILTER = "ticker='GOOG'"

df = index.query(SEARCH_QUERY, filters=FILTER, 
                 semantic_search_cutoff=.1, semantic_search_weight=.3, 
                 max_excerpts_per_doc=20, limit=200)
displayText(df.iloc[[0, -1]], highlight=["fx", "foreign", "exchange", "stabilization", "pressures", "slowdown", "pullback"])

Result 1/27

GOOG 2023Q1
  1. Google Advertising Revenue Growth: 24%
  2. Macroeconomic Headwinds: 21%
  3. Alphabet Earnings Calls: 18%
  4. Generative AI in Search: 18%
  5. Collaboration and Gratitude: 13%

Sundar Pichai: Thank you, Jim, and good afternoon, everyone. It’s clear that after a period of significant acceleration in digital spending during the pandemic, the macroeconomic climate has become more challenging. We continue to have an extraordinary business and provide immensely valuable services for people and our partners. For example, during the World Cup final on December 18, Google Search saw its highest query per second volume of all time. And beyond our advertising business, we have strong momentum in Cloud, YouTube subscriptions and hardware.

Result 27/27

GOOG 2024Q2
  1. Foreign Exchange Impact: 44%
  2. Connected TV Advertising: 21%
  3. Alphabet Earnings Calls: 15%
  4. YouTube Shorts Growth: 13%

Philipp Schindler: Yes. Look, this is a great question, first of all. I mean, let’s start with the fact that YouTube performance was very strong in this quarter. And on Shorts specifically in the U.S., I mentioned how the monetization rate of Shorts relative to in-stream viewing has more than doubled in the last 12 months. I think that’s what you were referring to. And yes, we’re obviously very happy with this development.

Exact Match misses 75% of the Results

The exact match search only hits on a single result. We are missing 20/27 of the matching exchanges because of the restrictiveness of exact matching rules.

## Setting semantic search weight to 0 forces an exact match only search. 
df = index.query(SEARCH_QUERY, filters=FILTER, semantic_search_cutoff=.1, semantic_search_weight=0, max_excerpts_per_doc=20, limit=200)
displayText(df.iloc[[0, -1]], highlight=["fx", "foreign", "exchange", "stabilization", "pressures", "slowdown", "pullback"])

Result 1/7

GOOG 2023Q1
  1. Google Advertising Revenue Growth: 53%
  2. Advertising Revenue Trends: 40%
  3. Cloud Performance Metrics: 2%

I’ll highlight 2 other factors that affected our Ads business in Q4. Ruth will provide more detail. In Search and Other, revenues grew moderately year-over-year, excluding the impact of FX, reflecting an increase in retail and travel, offset partially by a decline in finance. At the same time, we saw further pullback in spend by some advertisers in Search in Q4 versus Q3. In YouTube and Network, the year-over-year revenue declines were due to a broadening of pullbacks in advertiser spend in the fourth quarter.

Result 7/7

GOOG 2025Q1
  1. Google Advertising Revenue Growth: 58%
  2. Business Growth Strategies: 33%
  3. Alphabet Earnings Calls: 3%

Anat Ashkenazi: And on the question regarding my comment on lapping the strength in financial services, this is primarily related to the structural changes with regards to insurance, it is more specifically within financial services, it was the insurance segment and we saw that continue, but it was a one-time kind of a step up and then we saw it throughout the year. I am not going to give any specific numbers as to what we expect to see in 2025, but I am pleased with the fact that we are seeing and continue to see strength across really all verticals including retail and exiting the year in a position of strength. If anything, I would highlight as you think about the year, the comments I have made about the impact of FX, as well as the fact that we have one less day of revenue in Q1.

Jumping Back to the High Level.

In our FX search query, the data very useful, but that’s a lot to read and digest. Let’s try to summarize that data into a high level overview. Because our topicSearch API support the exact same parameters as our query API, we can instantly switch between high level insights and granular data.

df = index.topicSearch(SEARCH_QUERY, FILTER, semantic_search_cutoff=.1)
df["search_query"] = SEARCH_QUERY
fig = px.sunburst(df, path=["search_query", "short_title"], values="prevalence", hover_data=["topic_id"],)
procFig(fig, height=500).show()