This notebook allows you to get up and running quickly with the Sturdy Statistics API. We will create a series of visualization to explore past two years of Earnings Calls from Google, Microsoft, Amazon, NVIDA, and META. Earnings Calls are a quarterly event during which a public company will discuss the results of the past quarter and take questions from its investors. These calls offer a unique candid glimpse into both the company’s outlooks as well as that of the tech industry as a whole.
Reproduce the Results
We will be using an index from our pretrained gallery for this analysis. You can sign up on our website to generate your free, no payment info required, API key to run this notebook yourself and to upload your own data for analysis.
## Basic Utilitiespx.defaults.template ="simple_white"# Change the templatepx.defaults.color_discrete_sequence = px.colors.qualitative.Dark24 # Change color sequencedef procFig(fig, **kwargs): fig.update_layout(plot_bgcolor="rgba(0, 0, 0, 0)", paper_bgcolor="rgba(0, 0, 0, 0)", margin=dict(l=0,r=0,b=0,t=30,pad=0),**kwargs ) fig.layout.xaxis.fixedrange =True fig.layout.yaxis.fixedrange =Truereturn figdef displayText(df, highlight):def processText(row): t ="\n".join([ f'1. {r["short_title"]}: {int(r["prevalence"]*100)}%'for r in row["paragraph_topics"][:5] ]) x = row["text"].replace("*", "").replace("$", "") res = []for word in x.split(" "):for term in highlight:if term.lower() in word.lower() and"**"notin word: word ="**"+word+"**" res.append(word)returnf"<em>\n\n#### Result {row.name+1}/{df.index.max()+1}\n\n##### {row['ticker']}{row['pub_quarter']}\n\n"+ t +"\n\n"+" ".join(res) +"</em>" res = df.apply(processText, axis=1).tolist() display(Markdown(f"\n\n...\n\n".join(res)))
Structured Exploration
The Index Object
The core building block in the Sturdy Statistics NLP toolkit is the Index. Each Index is a set of documents and metadata that has been structured or “indexed” by our hierarchical bayesian probability mixture model. Below we are connecting to an Index that has already been trained by our earnings transcripts integration.
index = Index(id="index_c6394fde5e0a46d1a40fb6ddd549072e")
Found an existing index with id="index_c6394fde5e0a46d1a40fb6ddd549072e".
Topic Search
The first API we will explore is the topicSearch api. This api provides a direct interface to the high level themes that our index extracts. You can call with no arguments to get a list of topics ordered by how often they occur in the dataset (prevalence). The resulting data is a structured rollup of all the data in the corpus. It aggregates the topic annotations across each word, paragraph, and document and generates high level semantic statistics.
We visualize our this thematic data in the Sunburst visualization below. The inner circle of the sunburst is the title of the plot. The middle layer is the topic groups. And the leaf nodes are the topics that belong to the corresponding topic group. The size of each node is porportional to how often it shows up in the dataset.
The topic search API (along with our other semantic APIs) produce high level insights. In order to both dive deeper into and verify these insights, we provide a mechanism to retrieve the underlying data with our query API. This api shares a unified filtering engine with our higher level semantic apis. Any semantic rollup or insight aggregation can be instantly “unrolled”.
Example: AI Model Scaling
Let’s take the topic AI Model Scaling. We can uncover the topic metadata below and see that it was mentioned 81 times in the corpus.
We can call the index.query API, passing in our topic_id. We can see that we have 81 mentions returned, lining up exactly with our aggregate apis. Below we display the first and last result of our search, as well as highlight a few terms to make the excerpts easier to read.
You will notice that accompanying each excerpt is a set of tags. These are the same tags that are returned in our topicSearch api. Here each tag corresponds to the percentage of the paragraph that it comprises.
df = index.query(topic_id=TOPIC_ID, max_excerpts_per_doc=200, limit=200) ## 200 is the single request limitdisplayText(df.iloc[[0, -1]], highlight=["ai", "generation", "train", "data", "center", "scale"])assertlen(df) == row.iloc[0].mentions
Result 1/81
MSFT 2022Q4
AI and Cloud Optimization: 47%
Microsoft Investment in AI Partnerships: 27%
AI-Driven Developer Productivity: 15%
AI Model Scaling: 6%
Satya Nadella: Thanks for the question. First, yes, the OpenAI partnership is a very critical partnership for us. Perhaps, it’s sort of important to call out that we built the supercomputing capability inside of Azure, which is highly differentiated, the way computing the network, in particular, come together in order to support these large-scaletraining of these platform models or foundation models has been very critical. That’s what’s driven, in fact, the progress OpenAI has been making. And of course, we then productized it as part of Azure OpenAI services. And that’s what you’re seeing both being used by our own first-party applications, whether it’s the GitHub Copilot or Design even inside match. And then, of course, the third parties like Mattel. And so, we’re very excited about that. We have a lot sort of more sort of talk about when it comes to GitHub universe. I think you’ll see more advances on the GitHub Copilot, which is off to a fantastic start. But overall, this is an area of huge investment. The AI comment clearly has arrived. And it’s going to be part of every product, whether it’s, in fact, you mentioned Power Platform, because that’s another area we are innovating in terms of corporate all of these AI models.
…
Result 81/81
META 2025Q1
AI Capital Investment Strategy: 58%
Accelerated Computing Systems: 24%
AWS Generative AI Innovations: 6%
Open Source AI Infrastructure: 4%
Advertising Automation: 2%
Susan Li: Brian, I’m happy to take your second question about custom silicon. So first of all, we expect that we are continuing to purchase third-party silicon from leading providers in the industry. And we are certainly committed to those long-standing partnerships, but we’re also very invested in developing our own custom silicon for unique workloads, where off-the-shelf silicon isn’t necessarily optimal and specifically, because we’re able to optimize the full stack to achieve greater compute efficiency and performance per cost and power because our workloads might require a different mix of memory versus network, bandwidth versus compute and so we can optimize that really to the specific needs of our different types of workloads. Right now, the in-house MTIA program is focused on supporting our core ranking and recommendation inference workloads. We started adopting MTIA in the first half of 2024 for core ranking and recommendations inference. We’ll continue ramping adoption for those workloads over the course of 2025 as we use it for both incremental capacity and to replace some GPU-based servers when they reach the end of their useful lives. Next year, we’re hoping to expand MTIA to support some of our core AItraining workloads and over time, some of our Gen AI use cases.
Semantic Analysis in SQL
Our model granularly annotates every word, sentence, paragraph and document with topic information. The structured nature of our semantic data allows us to store this data in a structured tabular format alongside any relevant metadata. This means we can perform complex semantic analyses directly in SQL.
AI Model Scaling over Time
The SQL statement below is a standard group by. The only new content is the line: (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT. Sturdy Statistics stores thematic content arrays in a sparse format of a list of indices and a list of values. This format provides significant storage and performance optimizations. We use a defined set of sparse functions to work with this data.
Below we give it the fields c_mean_avg_inds and c_mean_avg_vals. The original c_mean_avg array is a count of the number of words in each paragraph that have been assigned to a topic. The mean_avg denotes that this value has been accumulated over several hundred MCMC samples. This sampling has numerous benefits and is also why our counts are not integers (a very common question we receive).
TOPIC_ID =53df = index.queryMeta(f"""SELECT pub_quarter, sum( (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT ) as mentionsFROM paragraphGROUP BY pub_quarterORDER BY pub_quarter """)fig = px.bar( df, x="pub_quarter", y="mentions", title=f"Mentions of 'AI Model Scaling'",# line_shape="hvh",)procFig(fig, title_x=.5).show()
Broken Down by Company
Because this semantic data is stored directly in a SQL table, we can enrich our semantic analysis with metadata. Below, we are able to break how much each company is dicussing the topic AI Model Scaling and when they are talking about it.
TOPIC_ID =53df = index.queryMeta(f"""SELECT pub_quarter, ticker, sum( (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT ) as mentionsFROM paragraphGROUP BY pub_quarter, tickerORDER BY pub_quarter """)fig = px.bar( df, x="pub_quarter", y="mentions", color="ticker", title=f"Mentions of 'AI Model Scaling'",)procFig(fig, title_x=.5).show()
With a Search Filter
In addition to storing the thematic content directly in the SQL tables, we integrate our semantic search engine within SQL. Below we pass the semantic search query infrastructure as a filter for our analysis.
TOPIC_ID =53df = index.queryMeta(f"""SELECT pub_quarter, ticker, sum( (sparse_list_extract({TOPIC_ID+1}, c_mean_avg_inds, c_mean_avg_vals) > 2.00)::INT ) as mentionsFROM paragraphGROUP BY pub_quarter, tickerORDER BY pub_quarter """,search_query="infrastructure")fig = px.bar( df, x="pub_quarter", y="mentions", color="ticker", title=f"Mentions of 'AI Model Scaling'",)procFig(fig, title_x=.5).show()
Verification & Insights
As always, any high level insights can be tied back to the underlying data that comprises it. Below, we pull up all examples of AI Model Scaling that focus on the search term Infrastructure during Meta’s 2025Q1 earning’s call. We assert that there are 4 examples (the value our bar chart provides) returned and we display the first and last ranked example.
Note that the last example does not explicitly mention Infrastructure but instead matches on terms such as CapEx and data centers.
Douglas Anmuth: Thanks for taking the questions. One for Mark, one for Susan. Mark, just following up on open source as DeepSeek and other models potentially leverage Llama or others to train faster and cheaper. How does this impact in your view? And what could have been for the trajectory of investment required over a multiyear period? And then, Susan, just as we think about the 60 billion to 65 billion CapEx this year, does the composition change much from last year when you talked about servers as the largest part followed by datacenters and networking equipment. And how should we think about that mix between like training and inference just following up on Jan’s post this week? Thanks.
…
Result 4/4
META 2025Q1
Zuckerberg on Business Strategies: 34%
AI Model Scaling: 31%
Capital Expenditure Trends: 6%
AI Capital Investment Strategy: 6%
AWS Capital Investments: 5%
But overall, I would reiterate what Mark said. We are committed to building leading foundation models and applications. We expect that we’re going to make big investments to support our training and inference objectives, and we don’t know exactly where we are in the cycle of that yet.
Zoom Out
At any point, we can also zoom back out into the high level topic view. Instead of focusing in on the AI Model Scaling topic, we can instead zoom out and see everything Meta discussed about infrastructure during 2025Q1’s earning call.