Structured Semantic Analysis

Our model granularly annotates every word, sentence, paragraph and document with topic information. The structured nature of our semantic data allows us to store this data in a structured tabular format alongside any relevant metadata. This means we can perform complex semantic analyses directly in SQL.

AI Model Scaling over Time

The SQL statement below is a standard group by. The only new content is the line: (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT. Sturdy Statistics stores thematic content arrays in a sparse format of a list of indices and a list of values. This format provides significant storage and performance optimizations. We use a defined set of sparse functions to work with this data.

Below we give it the fields c_mean_avg_inds and c_mean_avg_vals. The original c_mean_avg array is a count of the number of words in each paragraph that have been assigned to a topic. The mean_avg denotes that this value has been accumulated over several hundred MCMC samples. This sampling has numerous benefits and is also why our counts are not integers (a very common question we receive).

TOPIC_ID = 53
df = index.queryMeta(f"""
SELECT 
    pub_quarter,
    sum(
        (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT
    ) as mentions
FROM paragraph
GROUP BY pub_quarter
ORDER BY pub_quarter 
""")


fig = px.bar(
    df, x="pub_quarter", y="mentions", 
    title=f"Mentions of 'AI Model Scaling'",
   # line_shape="hvh",
)
fig.update_layout(title_x=.5).show()

Broken Down by Company

Because this semantic data is stored directly in a SQL table, we can enrich our semantic analysis with metadata. Below, we are able to break how much each company is dicussing the topic AI Model Scaling and when they are talking about it.

TOPIC_ID = 53
df = index.queryMeta(f"""
SELECT 
    pub_quarter,
    ticker,
    sum(
        (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT
    ) as mentions
FROM paragraph
GROUP BY pub_quarter, ticker
ORDER BY pub_quarter 
""")


fig = px.bar(
    df, x="pub_quarter", y="mentions", color="ticker",
    title=f"Mentions of 'AI Model Scaling'",
)
fig.update_layout(title_x=.5).show()

With a Search Filter

In addition to storing the thematic content directly in the SQL tables, we integrate our semantic search engine within SQL. Below we pass the semantic search query infrastructure as a filter for our analysis.

TOPIC_ID = 53
df = index.queryMeta(f"""
SELECT 
    pub_quarter,
    ticker,
    sum(
        (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT
    ) as mentions
FROM paragraph
GROUP BY pub_quarter, ticker
ORDER BY pub_quarter 
""",
search_query="infrastructure")


fig = px.bar(
    df, x="pub_quarter", y="mentions", color="ticker",
    title=f"Mentions of 'AI Model Scaling'",
)
fig.update_layout(title_x=.5).show()

Verification & Insights

As always, any high level insights can be tied back to the underlying data that comprises it. Below, we pull up all examples of AI Model Scaling that focus on the search term Infrastructure during Meta’s 2025Q1 earning’s call. We assert that there are 4 examples (the value our bar chart provides) returned and we display the first and last ranked example.

Note that the last example does not explicitly mention Infrastructure but instead matches on terms such as CapEx and data centers.

df = index.query(topic_id=TOPIC_ID, search_query='infrastructure', 
                 filters="ticker='META' AND pub_quarter='2025Q1'")
words_to_highlight = topicWords.loc[topicWords.topic_id == TOPIC_ID].topic_words.explode().tolist()
display_text(df.iloc[[0, -1]], words_to_highlight)
assert len(df) == 4

Result 1/4

META 2025Q1
  1. Zuckerberg on Business Strategies: 53%
  2. AI Model Scaling: 24%
  3. Open Source AI Infrastructure: 11%
  4. Growth Initiatives: 5%
  5. Capital Expenditure Trends: 2%

Mark Zuckerberg: I can start on the DeepSeek question. I think there’s a number of novel things that they did that I think we’re still digesting. And there are a number of things that they have advances that we will hope to implement in our systems. And that’s part of the nature of how this works, whether it’s a Chinese competitor or not. I kind of expect that every new company that has an advance – that has a launch is going to have some new advances that the rest of the field learns from. And that’s sort of how the technology industry goes. I don’t know – it’s probably too early to really have a strong opinion on what this means for the trajectory around infrastructure and CapEx and things like that. There are a bunch of trends that are happening here all at once. There’s already sort of a debate around how much of the compute infrastructure that we’re using is going to go towards pretraining versus as you get more of these reasoning time models or reasoning models where you get more of the intelligence by putting more of the compute into inference, whether just will mix shift how we use our compute infrastructure towards that. That was already something that I think a lot of the other labs and ourselves were starting to think more about and already seemed pretty likely even before this, that – like of all the compute that we’re using, that the largest pieces aren’t necessarily going to go towards pre-training.

Result 4/4

META 2025Q1
  1. AI Capital Investment Strategy: 59%
  2. Capital Expenditure Trends: 25%
  3. AI Infrastructure Investment: 8%
  4. AI Supercomputing via Hopper: 3%

Susan Li: I’m happy to add a little more color about our 2025 CapEx plans to your second question. So we certainly expect that 2025 CapEx is going to grow across all three of those components you described. Servers will be the biggest growth driver that remains the largest portion of our overall CapEx budget. We expect both growth in AI capacity as we support our gen AI efforts and continue to invest meaningfully in core AI, but we are also expecting growth in non-AI capacity as we invest in the core business, including to support a higher base of engagement and to refresh our existing servers. On the data center side, we’re anticipating higher data center spend in 2025 to be driven by build-outs of our large training clusters and our higher power density data centers that are entering the core construction phase. We’re expecting to use that capacity primarily for core AI and non-AI use cases. On the networking side, we expect networking spend to grow in ’25 as we build higher-capacity networks to accommodate the growth in non-AI and core AI-related traffic along with our large Gen AI training clusters. We’re also investing in fiber to handle future cross-region training traffic. And then in terms of the breakdown for core versus Gen AI use cases, we’re expecting total infrastructure spend within each of Gen AI, non-AI and core AI to increase in ’25 with the majority of our CapEx directed to our core business with some caveat that, that is – that’s not easy to measure perfectly as the data centers we’re building can support AI or non-AI workloads and the GPU-based servers, we procure for gen AI can be repurposed for core AI use cases and so on and so forth.

Zoom Out

At any point, we can also zoom back out into the high level topic view. Instead of focusing in on the AI Model Scaling topic, we can instead zoom out and see everything Meta discussed about infrastructure during 2025Q1’s earning call.

topic_df = index.topicSearch("infrastructure", filters="ticker='META' and pub_quarter='2025Q1'")
topic_df["title"] = "Meta 2025Q1 Discussions"
fig = px.sunburst(
    topic_df, 
    path=["title", "short_title"], 
    values="prevalence", 
    hover_data=["topic_id", "mentions"]
)
fig.update_layout(height=500).show()

What’s Next?

Section Description
Part IV Statistically Tuned Search
Part V Custom Index Creation