Tutorial

Introduction

A robust ecosystem of tools—like spreadsheets, regression analyses, and SQL—makes structured data easy to work with. But they fall short when it comes to critical unstructured data—support tickets, customer conversations, financial disclosures, product reviews, and more.

The rise of Large Language Models (LLMs) has enabled a wave of prompt-driven tools for unstructured data analysis and retrieval-augmented generation (RAG). But these tools often lack the structure, extensibility, and statistical rigor of traditional tabular methods -—making them hard to audit or integrate into established workflows.

Sturdy Statistics bridges that gap. Sturdy Statistics transforms unstructured text into structured, interpretable data— using models that are transparent, verifiable, and robust. You don’t need to write prompts, tune embeddings, or trust a black box. Every insight can be inspected, audited, and traced back to specific passages in your data.

[1]

Sturdy Statistics automatic structure enables

  • Data Scientists to apply traditional statistical models on unstructured data
  • Engineers to build robust NLP workflows
  • Analysts to analyze granular natural language data with SQL

All with confidence in how the outputs were generated and with the ability to easily verify every datapoint.

In the following walkthrough, we introduce Sturdy Statistics’ ability to reveal structured insights for unstructured data, not with RAG or LLM black boxes but with rigorous, statistical analysis that leverages traditional tabular data structures. We will analyze the past two years of Earnings Calls from Google, Microsoft, Amazon, NVIDIA, and META.

Resources

To follow along:

For a deeper dive, explore:

Structured Exploration

The core building block in the Sturdy Statistics NLP toolkit is the Index. Each Index is a set of documents and metadata that has been structured or “indexed” by our hierarchical bayesian probability mixture model. Below we are connecting to an Index that has already been trained by our earnings transcripts integration.

index = Index(id="index_c6394fde5e0a46d1a40fb6ddd549072e") 
Found an existing index with id="index_c6394fde5e0a46d1a40fb6ddd549072e".

You can see that we have imputed meaningful topics with interpretable names, and that each document can be understood according to its thematic content. Unlike LLM embeddings, each column in this representation corresponds to a meaningful, immediately interpretable concept, and most of the columns are zero. These properties greatly enhance the utility of our representation.

[2]

Our index maps this set of learned topics not only to each document, but also to every single word, sentence, paragraph, and group of documents in your data dataset, providing a powerful semantic indexing. We then expose the data through a set of standardized SQL APIs that integrate with existing structured data analysis toolkits.

Topic Search

The first API we will explore is the topicSearch api. This api provides a direct interface to the high level themes that our index extracts. You can call with no arguments to get a list of topics ordered by how often they occur in the dataset (prevalence). The resulting data is a structured rollup of all the data in the corpus. It aggregates the topic annotations across each word, paragraph, and document and generates high level semantic statistics.

The fields returned by this api include a unique topic identifier topic_id, a title assigned to the topic short_title, the number of paragraphs in which that topic is present in the dataset mentions and the percentage of the entire corpus that has been assigned to that topic prevalence.

df = index.topicSearch()
df.head()[["topic_id", "short_title", "topic_group_short_title", "mentions", "prevalence"]]
topic_id short_title topic_group_short_title mentions prevalence
0 159 Accelerated Computing Systems Technological Developments 359.0 0.042775
1 139 Consumer Behavior Insights Growth Strategies 585.0 0.033129
2 108 Cloud Performance Metrics Investment and Financials 157.0 0.026985
3 115 Zuckerberg on Business Strategies Corporate Strategy 420.0 0.026971
4 127 Comprehensive Security Solutions Investment and Financials 146.0 0.023265

Sunburst

This structured information already reveals a lot of information into this dataset. We know what was talked about and how much it was discussed. We can see there is a field we didn’t yet mention: topic_group_short_title. While our topics tend to be extremely granular, our model also learns a structure on top of these topics that we designate a topic_group. A dataset can have anywhere from 100-500 topics, but will typically have only 20-50 topic groups. This hierarchy is extremly useful for organizing and exploring data in hierarchical formats such as sunbursts.

We visualize our this thematic data in the Sunburst visualization below. The inner circle of the sunburst is the title of the plot. The middle layer is the topic groups. And the leaf nodes are the topics that belong to the corresponding topic group. The size of each node is porportional to how often it shows up in the dataset.

topic_df = index.topicSearch()
topic_df["title"] = "Tech <br> Earnings Calls"
fig = px.sunburst(
    topic_df, 
    path=["title", "topic_group_short_title", "short_title"], 
    values="prevalence", 
    hover_data=["topic_id", "mentions"]
)
procFig(fig, height=500).show()