Quickstart

Intro
Earnings Calls
Author

Kian Ghodoussi

Published

March 4, 2025

Introduction

This notebook allows you to get up and running quickly with the Sturdy Statistics API. We will create a series of visualization to explore past two years of Earnings Calls from Google, Microsoft, Amazon, NVIDA, and META. Earnings Calls are a quarterly event during which a public company will discuss the results of the past quarter and take questions from its investors. These calls offer a unique candid glimpse into both the company’s outlooks as well as that of the tech industry as a whole.

Reproduce the Results

We will be using an index from our pretrained gallery for this analysis. You can sign up on our website to generate your free, no payment info required, API key to run this notebook yourself and to upload your own data for analysis.

Prerequisites

pip install sturdy-stats-sdk pandas numpy plotly

Code
# pip install sturdy-stats-sdk pandas numpy plotly duckdb
from IPython.display import display, Markdown, Latex
import pandas as pd
import numpy as np
import plotly.express as px
from sturdystats import Index, Job

from pprint import pprint

index_id = "index_c6394fde5e0a46d1a40fb6ddd549072e"
index = Index(id=index_id)
Code
## Basic Utilities
px.defaults.template = "simple_white"  # Change the template
px.defaults.color_discrete_sequence = px.colors.qualitative.Dark24 # Change color sequence

def procFig(fig, **kwargs):
    fig.update_layout(plot_bgcolor= "rgba(0, 0, 0, 0)", paper_bgcolor= "rgba(0, 0, 0, 0)",
        margin=dict(l=0,r=0,b=0,t=30,pad=0),
        **kwargs
    )
    fig.layout.xaxis.fixedrange = True
    fig.layout.yaxis.fixedrange = True
    return fig

def displayText(df, highlight):
    def processText(row):
        t = "\n".join([ f'1. {r["short_title"]}: {int(r["prevalence"]*100)}%' for r in row["paragraph_topics"][:5] ])
        x = row["text"].replace("*", "").replace("$", "")
        res = []
        for word in x.split(" "):
            for term in highlight:
                if term.lower() in word.lower() and "**" not in word:
                    word = "**"+word+"**"
            res.append(word)
        return f"<em>\n\n#### Result {row.name+1}/{df.index.max()+1}\n\n##### {row['ticker']} {row['pub_quarter']}\n\n"+ t +"\n\n" + " ".join(res) + "</em>"

    res = df.apply(processText, axis=1).tolist()       
    display(Markdown(f"\n\n...\n\n".join(res)))

Structured Exploration

The Index Object

The core building block in the Sturdy Statistics NLP toolkit is the Index. Each Index is a set of documents and metadata that has been structured or “indexed” by our hierarchical bayesian probability mixture model. Below we are connecting to an Index that has already been trained by our earnings transcripts integration.

index = Index(id="index_c6394fde5e0a46d1a40fb6ddd549072e") 
Found an existing index with id="index_c6394fde5e0a46d1a40fb6ddd549072e".

Topic Search

The first API we will explore is the topicSearch api. This api provides a direct interface to the high level themes that our index extracts. You can call with no arguments to get a list of topics ordered by how often they occur in the dataset (prevalence). The resulting data is a structured rollup of all the data in the corpus. It aggregates the topic annotations across each word, paragraph, and document and generates high level semantic statistics.

[2]

Mentions refers to the number of paragraphs in which the topic occurs. Prevalence refers to the total percentage of all data that a topic comprises.

topic_df = index.topicSearch()
topic_df.head()[["topic_id", "short_title", "topic_group_short_title", "mentions", "prevalence"]]
topic_id short_title topic_group_short_title mentions prevalence
0 159 Accelerated Computing Systems Technological Developments 359.0 0.042775
1 139 Consumer Behavior Insights Growth Strategies 585.0 0.033129
2 108 Cloud Performance Metrics Investment and Financials 157.0 0.026985
3 115 Zuckerberg on Business Strategies Corporate Strategy 420.0 0.026971
4 127 Comprehensive Security Solutions Investment and Financials 146.0 0.023265

Hierarchical Visualization

We visualize our this thematic data in the Sunburst visualization below. The inner circle of the sunburst is the title of the plot. The middle layer is the topic groups. And the leaf nodes are the topics that belong to the corresponding topic group. The size of each node is porportional to how often it shows up in the dataset.

topic_df["title"] = "Tech <br> Earnings Calls"
fig = px.sunburst(
    topic_df, 
    path=["title", "topic_group_short_title", "short_title"], 
    values="prevalence", 
    hover_data=["topic_id", "mentions"]
)
fig = procFig(fig, height=500)
fig.show()

Granular Retrieval

The topic search API (along with our other semantic APIs) produce high level insights. In order to both dive deeper into and verify these insights, we provide a mechanism to retrieve the underlying data with our query API. This api shares a unified filtering engine with our higher level semantic apis. Any semantic rollup or insight aggregation can be instantly “unrolled”.

Example: AI Model Scaling

Let’s take the topic AI Model Scaling. We can uncover the topic metadata below and see that it was mentioned 81 times in the corpus.

TOPIC_ID = 53
row = topic_df.loc[topic_df.topic_id == TOPIC_ID]
row[["topic_id", "short_title", "mentions"]]
topic_id short_title mentions
81 53 AI Model Scaling 81.0

Query: AI Model Scaling

We can call the index.query API, passing in our topic_id. We can see that we have 81 mentions returned, lining up exactly with our aggregate apis. Below we display the first and last result of our search, as well as highlight a few terms to make the excerpts easier to read.

You will notice that accompanying each excerpt is a set of tags. These are the same tags that are returned in our topicSearch api. Here each tag corresponds to the percentage of the paragraph that it comprises.

df = index.query(topic_id=TOPIC_ID, max_excerpts_per_doc=200, limit=200) ## 200 is the single request limit
displayText(df.iloc[[0, -1]], highlight=["ai", "generation", "train", "data", "center", "scale"])
assert len(df) == row.iloc[0].mentions 

Result 1/81

MSFT 2022Q4
  1. AI and Cloud Optimization: 47%
  2. Microsoft Investment in AI Partnerships: 27%
  3. AI-Driven Developer Productivity: 15%
  4. AI Model Scaling: 6%

Satya Nadella: Thanks for the question. First, yes, the OpenAI partnership is a very critical partnership for us. Perhaps, it’s sort of important to call out that we built the supercomputing capability inside of Azure, which is highly differentiated, the way computing the network, in particular, come together in order to support these large-scale training of these platform models or foundation models has been very critical. That’s what’s driven, in fact, the progress OpenAI has been making. And of course, we then productized it as part of Azure OpenAI services. And that’s what you’re seeing both being used by our own first-party applications, whether it’s the GitHub Copilot or Design even inside match. And then, of course, the third parties like Mattel. And so, we’re very excited about that. We have a lot sort of more sort of talk about when it comes to GitHub universe. I think you’ll see more advances on the GitHub Copilot, which is off to a fantastic start. But overall, this is an area of huge investment. The AI comment clearly has arrived. And it’s going to be part of every product, whether it’s, in fact, you mentioned Power Platform, because that’s another area we are innovating in terms of corporate all of these AI models.

Result 81/81

META 2025Q1
  1. AI Capital Investment Strategy: 58%
  2. Accelerated Computing Systems: 24%
  3. AWS Generative AI Innovations: 6%
  4. Open Source AI Infrastructure: 4%
  5. Advertising Automation: 2%

Susan Li: Brian, I’m happy to take your second question about custom silicon. So first of all, we expect that we are continuing to purchase third-party silicon from leading providers in the industry. And we are certainly committed to those long-standing partnerships, but we’re also very invested in developing our own custom silicon for unique workloads, where off-the-shelf silicon isn’t necessarily optimal and specifically, because we’re able to optimize the full stack to achieve greater compute efficiency and performance per cost and power because our workloads might require a different mix of memory versus network, bandwidth versus compute and so we can optimize that really to the specific needs of our different types of workloads. Right now, the in-house MTIA program is focused on supporting our core ranking and recommendation inference workloads. We started adopting MTIA in the first half of 2024 for core ranking and recommendations inference. We’ll continue ramping adoption for those workloads over the course of 2025 as we use it for both incremental capacity and to replace some GPU-based servers when they reach the end of their useful lives. Next year, we’re hoping to expand MTIA to support some of our core AI training workloads and over time, some of our Gen AI use cases.

Semantic Analysis in SQL

Our model granularly annotates every word, sentence, paragraph and document with topic information. The structured nature of our semantic data allows us to store this data in a structured tabular format alongside any relevant metadata. This means we can perform complex semantic analyses directly in SQL.

AI Model Scaling over Time

The SQL statement below is a standard group by. The only new content is the line: (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT. Sturdy Statistics stores thematic content arrays in a sparse format of a list of indices and a list of values. This format provides significant storage and performance optimizations. We use a defined set of sparse functions to work with this data.

Below we give it the fields c_mean_avg_inds and c_mean_avg_vals. The original c_mean_avg array is a count of the number of words in each paragraph that have been assigned to a topic. The mean_avg denotes that this value has been accumulated over several hundred MCMC samples. This sampling has numerous benefits and is also why our counts are not integers (a very common question we receive).

TOPIC_ID = 53
df = index.queryMeta(f"""
SELECT 
    pub_quarter,
    sum(
        (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT
    ) as mentions
FROM paragraph
GROUP BY pub_quarter
ORDER BY pub_quarter 
""")


fig = px.bar(
    df, x="pub_quarter", y="mentions", 
    title=f"Mentions of 'AI Model Scaling'",
   # line_shape="hvh",
)
procFig(fig, title_x=.5).show()

Broken Down by Company

Because this semantic data is stored directly in a SQL table, we can enrich our semantic analysis with metadata. Below, we are able to break how much each company is dicussing the topic AI Model Scaling and when they are talking about it.

TOPIC_ID = 53
df = index.queryMeta(f"""
SELECT 
    pub_quarter,
    ticker,
    sum(
        (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT
    ) as mentions
FROM paragraph
GROUP BY pub_quarter, ticker
ORDER BY pub_quarter 
""")


fig = px.bar(
    df, x="pub_quarter", y="mentions", color="ticker",
    title=f"Mentions of 'AI Model Scaling'",
)
procFig(fig, title_x=.5).show()

With a Search Filter

In addition to storing the thematic content directly in the SQL tables, we integrate our semantic search engine within SQL. Below we pass the semantic search query infrastructure as a filter for our analysis.

TOPIC_ID = 53
df = index.queryMeta(f"""
SELECT 
    pub_quarter,
    ticker,
    sum(
        (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT
    ) as mentions
FROM paragraph
GROUP BY pub_quarter, ticker
ORDER BY pub_quarter 
""",
search_query="infrastructure")


fig = px.bar(
    df, x="pub_quarter", y="mentions", color="ticker",
    title=f"Mentions of 'AI Model Scaling'",
)
procFig(fig, title_x=.5).show()

Verification & Insights

As always, any high level insights can be tied back to the underlying data that comprises it. Below, we pull up all examples of AI Model Scaling that focus on the search term Infrastructure during Meta’s 2025Q1 earning’s call. We assert that there are 4 examples (the value our bar chart provides) returned and we display the first and last ranked example.

Note that the last example does not explicitly mention Infrastructure but instead matches on terms such as CapEx and data centers.

df = index.query(topic_id=TOPIC_ID, search_query='infrastructure', 
                 filters="ticker='META' AND pub_quarter='2025Q1'")
displayText(df.iloc[[0, -1]], ["capex", "data", "center", "train", "infrastructure"])
assert len(df) == 4

Result 1/4

META 2025Q1
  1. Business Growth Strategies: 44%
  2. Capital Expenditure Trends: 21%
  3. AI Model Scaling: 13%
  4. Open Source AI Infrastructure: 10%
  5. Growth Initiatives: 5%

Douglas Anmuth: Thanks for taking the questions. One for Mark, one for Susan. Mark, just following up on open source as DeepSeek and other models potentially leverage Llama or others to train faster and cheaper. How does this impact in your view? And what could have been for the trajectory of investment required over a multiyear period? And then, Susan, just as we think about the 60 billion to 65 billion CapEx this year, does the composition change much from last year when you talked about servers as the largest part followed by data centers and networking equipment. And how should we think about that mix between like training and inference just following up on Jan’s post this week? Thanks.

Result 4/4

META 2025Q1
  1. Zuckerberg on Business Strategies: 34%
  2. AI Model Scaling: 31%
  3. Capital Expenditure Trends: 6%
  4. AI Capital Investment Strategy: 6%
  5. AWS Capital Investments: 5%

But overall, I would reiterate what Mark said. We are committed to building leading foundation models and applications. We expect that we’re going to make big investments to support our training and inference objectives, and we don’t know exactly where we are in the cycle of that yet.

Zoom Out

At any point, we can also zoom back out into the high level topic view. Instead of focusing in on the AI Model Scaling topic, we can instead zoom out and see everything Meta discussed about infrastructure during 2025Q1’s earning call.

topic_df = index.topicSearch("infrastructure", filters="ticker='META' and pub_quarter='2025Q1'")
topic_df["title"] = "Meta 2025Q1 Discussions"
fig = px.sunburst(
    topic_df, 
    path=["title", "short_title"], 
    values="prevalence", 
    hover_data=["topic_id", "mentions"]
)
procFig(fig, height=500).show()

Unlock Your Unstructured Data Today

from sturdystats import Index

index = Index("Custom Analysis")
index.upload(df.to_dict("records"))
index.commit()
index.train()

# Ready to Explore 
index.topicSearch()

More Examples