Structured Semantic Analysis

Our model granularly annotates every word, sentence, paragraph and document with topic information. The structured nature of our semantic data allows us to store this data in a structured tabular format alongside any relevant metadata. This means we can perform complex semantic analyses directly in SQL.

AI Model Scaling over Time

The SQL statement below is a standard group by. The only new content is the line: (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT. Sturdy Statistics stores thematic content arrays in a sparse format of a list of indices and a list of values. This format provides significant storage and performance optimizations. We use a defined set of sparse functions to work with this data.

Below we give it the fields c_mean_avg_inds and c_mean_avg_vals. The original c_mean_avg array is a count of the number of words in each paragraph that have been assigned to a topic. The mean_avg denotes that this value has been accumulated over several hundred MCMC samples. This sampling has numerous benefits and is also why our counts are not integers (a very common question we receive).

TOPIC_ID = 53
df = index.queryMeta(f"""
SELECT 
    pub_quarter,
    sum(
        (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT
    ) as mentions
FROM paragraph
GROUP BY pub_quarter
ORDER BY pub_quarter 
""")


fig = px.bar(
    df, x="pub_quarter", y="mentions", 
    title=f"Mentions of 'AI Model Scaling'",
   # line_shape="hvh",
)
procFig(fig, title_x=.5).show()

Broken Down by Company

Because this semantic data is stored directly in a SQL table, we can enrich our semantic analysis with metadata. Below, we are able to break how much each company is dicussing the topic AI Model Scaling and when they are talking about it.

TOPIC_ID = 53
df = index.queryMeta(f"""
SELECT 
    pub_quarter,
    ticker,
    sum(
        (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT
    ) as mentions
FROM paragraph
GROUP BY pub_quarter, ticker
ORDER BY pub_quarter 
""")


fig = px.bar(
    df, x="pub_quarter", y="mentions", color="ticker",
    title=f"Mentions of 'AI Model Scaling'",
)
procFig(fig, title_x=.5).show()

With a Search Filter

In addition to storing the thematic content directly in the SQL tables, we integrate our semantic search engine within SQL. Below we pass the semantic search query infrastructure as a filter for our analysis.

TOPIC_ID = 53
df = index.queryMeta(f"""
SELECT 
    pub_quarter,
    ticker,
    sum(
        (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT
    ) as mentions
FROM paragraph
GROUP BY pub_quarter, ticker
ORDER BY pub_quarter 
""",
search_query="infrastructure")


fig = px.bar(
    df, x="pub_quarter", y="mentions", color="ticker",
    title=f"Mentions of 'AI Model Scaling'",
)
procFig(fig, title_x=.5).show()

Verification & Insights

As always, any high level insights can be tied back to the underlying data that comprises it. Below, we pull up all examples of AI Model Scaling that focus on the search term Infrastructure during Meta’s 2025Q1 earning’s call. We assert that there are 4 examples (the value our bar chart provides) returned and we display the first and last ranked example.

Note that the last example does not explicitly mention Infrastructure but instead matches on terms such as CapEx and data centers.

df = index.query(topic_id=TOPIC_ID, search_query='infrastructure', 
                 filters="ticker='META' AND pub_quarter='2025Q1'")
displayText(df.iloc[[0, -1]], ["capex", "data", "center", "train", "infrastructure"])
assert len(df) == 4

Result 1/4

META 2025Q1
  1. Business Growth Strategies: 44%
  2. Capital Expenditure Trends: 21%
  3. AI Model Scaling: 13%
  4. Open Source AI Infrastructure: 10%
  5. Growth Initiatives: 5%

Douglas Anmuth: Thanks for taking the questions. One for Mark, one for Susan. Mark, just following up on open source as DeepSeek and other models potentially leverage Llama or others to train faster and cheaper. How does this impact in your view? And what could have been for the trajectory of investment required over a multiyear period? And then, Susan, just as we think about the 60 billion to 65 billion CapEx this year, does the composition change much from last year when you talked about servers as the largest part followed by data centers and networking equipment. And how should we think about that mix between like training and inference just following up on Jan’s post this week? Thanks.

Result 4/4

META 2025Q1
  1. Zuckerberg on Business Strategies: 34%
  2. AI Model Scaling: 31%
  3. Capital Expenditure Trends: 6%
  4. AI Capital Investment Strategy: 6%
  5. AWS Capital Investments: 5%

But overall, I would reiterate what Mark said. We are committed to building leading foundation models and applications. We expect that we’re going to make big investments to support our training and inference objectives, and we don’t know exactly where we are in the cycle of that yet.

Zoom Out

At any point, we can also zoom back out into the high level topic view. Instead of focusing in on the AI Model Scaling topic, we can instead zoom out and see everything Meta discussed about infrastructure during 2025Q1’s earning call.

topic_df = index.topicSearch("infrastructure", filters="ticker='META' and pub_quarter='2025Q1'")
topic_df["title"] = "Meta 2025Q1 Discussions"
fig = px.sunburst(
    topic_df, 
    path=["title", "short_title"], 
    values="prevalence", 
    hover_data=["topic_id", "mentions"]
)
procFig(fig, height=500).show()