Kanye West News

Integration
News
Intro
Author

Kian Ghodoussi

Published

March 12, 2025

In the following notebook, we will be recreating Sturdy Statistics’ DeepDive page using the sturdy-stats-sdk.

In this notebook we will reproducing this deep dive analysis on every news article that discusses Kanye West.

# pip install sturdy-stats-sdk pandas numpy plotly 
from IPython.display import display, Markdown, Latex
import pandas as pd
import numpy as np
import plotly.express as px
from sturdystats import Index, Job

from pprint import pprint
Code
## Basic Utilities
px.defaults.template = "simple_white"  # Change the template
px.defaults.color_discrete_sequence = px.colors.qualitative.Dark24 # Change color sequence

def procFig(fig, **kwargs):
    fig.update_layout(plot_bgcolor= "rgba(0, 0, 0, 0)", paper_bgcolor= "rgba(0, 0, 0, 0)",
        margin=dict(l=0,r=0,b=0,t=30,pad=0),
        title_x=.5,
        **kwargs
    )
    fig.layout.xaxis.fixedrange = True
    fig.layout.yaxis.fixedrange = True
    return fig

def displayText(df, highlight):
    def processText(row):
        t = "\n".join([ f'1. {r["short_title"]}: {int(r["prevalence"]*100)}%' for r in row["paragraph_topics"][:5] ])
        x = row["text"]
        res = []
        for word in x.split(" "):
            for term in highlight:
                if term in word.lower() and "**" not in word:
                    word = "**"+word+"**"
            res.append(word)
        return f"<em>\n\n#### Result {row.name+1}/{df.index.max()+1}\n\n##### {row['published']}\n\n"+ t +"\n\n" + " ".join(res) + "</em>"

    res = df.apply(processText, axis=1).tolist()       
    display(Markdown(f"\n\n...\n\n".join(res)))

1. [Optional] Train Your Own Model

Sturdy Statistics integrates directly with Hacker News. Below we query the hackernews_comments integration for all comments that mention duckdb.

Training a model on our hacker news integration takes anywhere from 5-10 minutes. This step is optional and you can instead proceed with our public duckdb analysis index.

index = Index(id="index_9e217662c7184573beef6406a1010a19")

# Uncomment the line below to create and train your own index
# index = Index(name="news_kanye_west") 

if index.get_status()["state"] == "untrained":
    index.ingestIntegration("news_date_split", "kanye west", )
    job = index.train(dict(subdoc_hierarchy=False), fast=True, wait=False)
    print(job.get_status())
    # job.wait() # Sleeps until job finishes
Found an existing index with id="index_9e217662c7184573beef6406a1010a19".

2. Core Visualizations

In this section, we will demonstrate how to produce the two core visualization in their simplest form: the sunburst and the time trend plot.

index = Index(id="index_9e217662c7184573beef6406a1010a19")
Found an existing index with id="index_9e217662c7184573beef6406a1010a19".

Sunburst

Our bayesian probabilistic model learns a set of high level topics from your corpus. These topics are completely custom to your data, whether your dataset has hundreds of documents or billions. The model then maps this set of learned topics to single every word, sentence, paragraph, document, and group of documents to your dataset, providing a powerful semantic indexing.

This indexing enables us to store data in a granular, structured tabular format. This structured format enables rapid analysis to complex questions.

Topic Query

df = index.topicSearch()
df.head(5)[["short_title", "topic_group_short_title", "topic_id", "mentions", "prevalence"]]
short_title topic_group_short_title topic_id mentions prevalence
0 Kanye's Creative Process Creative Works and Collaborations 463 3671.0 0.031586
1 Expression and Identity Creative Ventures 425 2943.0 0.026436
2 Raw Street Communication Art and Expression 72 2043.0 0.019585
3 Cultural Influence in Music Music and Culture 144 1772.0 0.018413
4 Kanye and Kim's Separation Celebrity Feuds and Relationships 291 1795.0 0.016874

Visualization

We can see there are two names: short_title and topic_group_short_title. The topic group is a high level thematic category while a topic is a much more granlular annotation.

A dataset can have hundreds of topics, but ussually only 20-50 topic groups. This hierarchy is extremly useful for organizing and exploring data in hierarchical formats such as sunbursts.

The Inner circle of the sunburst is the title of the plot. The middle layer is the topic groups. And the leaf nodes are the topics that belong to the corresponding topic group. The size of each node is porportional to how often it shows up in the dataset.

df["title"] = "Kanye West <br> News Publications"
fig = px.sunburst(
    df,
    path=["title", "topic_group_short_title", "short_title"], 
    values="prevalence", hover_data={"topic_id": True}
)
fig = procFig(fig, height=500)
fig.show()

Semantically Structured SQL

Sturdy Statistics embeds all of its semantic information into a tabular format. It directly exposes this tabular format through the queryMeta api.

In fact, all of our topic apis directly query the same tabular data structures that we expose in the queryMeta api.

Even More Granular Extraction

Every high level visualization or rollup can be instantly tied back to the original data, no matter how granular or complex.

Let’s saw we want to pull out all hacker news comments that discuss pandas and mention the topic Complex SQL Queries that happened in 2025. That is a simple API call

docdf = index.query(SEARCH_QUERY, topic_id=topic_id, semantic_search_cutoff=.2, limit=100, filters="year(published::DATE) = 2021")
print("Search:", SEARCH_QUERY, "Topic:", row.short_title.iloc[0])
displayText(docdf.iloc[[0, -1]], highlight=["duckdb", SEARCH_QUERY, "faster", "complex", "sql", "aggregate", "relational", "api"])
Search: presidential campaign Topic: Ballot Filing Issues

Result 1/5

2021-12-18
  1. Kanye West’s Presidential Bid: 65%
  2. Ballot Filing Issues: 13%
  3. Campaign Finance Violations: 10%
  4. Kanye West Campaign Controversy: 8%

Thanks to the high-level Republicans apparently steering West’s presidential campaign, the intended effect, whether the rapper knew it or not, may have been to serve as a spoiler to drain votes from Democratic nominee Joe Biden to then-President Donald Trump. In 2020 West sued the state of Wisconsin to appear on the 2020 ballot using a Trump-linked law firm – the elections commission later voted to keep him off.

Result 5/5

2021-12-17
  1. Ballot Filing Issues: 22%
  2. Campaign Finance Violations: 22%
  3. Mental Health and Accountability: 17%
  4. Kardashian-West Relationship: 17%
  5. Kanye West Campaign Controversy: 15%

Colorado-based attorney Mario Nicolais, who had scrutinized the West campaign’s ballot petition activity in Wisconsin last August, told The Daily Beast that the GOP’s targeting West—who, according to his then-wife Kim Kardashian, had been contending with mental health struggles—was “about as bottom-of-the-barrel moral turpitude as you can be, in my opinion. Just sleazy, low-rent cashing in.”

NB: our sql corresponds to our search docs

assert df.loc[(df["month"] >= "2021") & (df["month"] < "2022") ].comments.sum() == len(docdf)

Unlock Your Unstructured Data Today

from sturdystats import Index

index = Index("Custom Analysis")
index.upload(df.to_dict("records"))
index.commit()
index.train()

# Ready to Explore 
index.topicSearch()

More Examples