Radford Neal’s Publications over Time

Academic
Intermediate
Integration
Author

Kian Ghodoussi

Published

March 15, 2025

Prerequisites

pip install sturdy-stats-sdk pandas numpy plotly

Code
from IPython.display import display, Markdown, Latex
import pandas as pd
import numpy as np
import plotly.express as px
from sturdystats import Index, Job

from pprint import pprint
Code
## Basic Utilities
px.defaults.template = "simple_white"  # Change the template
px.defaults.color_discrete_sequence = px.colors.qualitative.Dark24 # Change color sequence

def procFig(fig, **kwargs):
    fig.update_layout(plot_bgcolor= "rgba(0, 0, 0, 0)", paper_bgcolor= "rgba(0, 0, 0, 0)",
        margin=dict(l=0,r=0,b=0,t=30,pad=0),
        **kwargs
    )
    fig.layout.xaxis.fixedrange = True
    fig.layout.yaxis.fixedrange = True
    return fig

def displayText(df, highlight):
    def processText(row):
        t = "\n".join([ f'1. {r["short_title"]}: {int(r["prevalence"]*100)}%' for r in row["paragraph_topics"][:5] ])
        x = row["text"]
        res = []
        for word in x.split(" "):
            for term in highlight:
                if term in word.lower() and "**" not in word:
                    word = "**"+word+"**"
            res.append(word)
        return f"<em>\n\n#### Result {row.name+1}/{df.index.max()+1}\n\n##### {row['ticker']} {row['pub_quarter']}\n\n"+ t +"\n\n" + " ".join(res) + "</em>"

    res = df.apply(processText, axis=1).tolist()       
    display(Markdown(f"\n\n...\n\n".join(res)))

[Optional] Train Your Own Index

index = Index(id="index_4fd50a6fa7444cbcb672db636d811f4b")

# Uncomment the line below to create and train your own index
# index = Index(name="Radford_Neal_Publications")

if index.get_status()["state"] == "untrained":
    # https://www.semanticscholar.org/author/Radford-M.-Neal/1764325
    author_id = "1764325"
    index.ingestIntegration("author_cn", author_id)
    index.train(dict(burn_in=1200, subdoc_hierarchy=False), fast=True)
    print(job.get_status())
    # job.wait() # Sleeps until job finishes
Found an existing index with id="index_4fd50a6fa7444cbcb672db636d811f4b".

Explore Topics

Our bayesian probabilistic model learns a set of high level topics from your corpus. These topics are completely custom to your data, whether your dataset has hundreds of documents or billions. The model then maps this set of learned topics to single every word, sentence, paragraph, document, and group of documents to your dataset, providing a powerful semantic indexing.

This indexing enables us to store data in a granular, structured tabular format. This structured format enables rapid analysis to complex questions.

index = Index(id="index_4fd50a6fa7444cbcb672db636d811f4b")
df = index.topicSearch()
df.head()
Found an existing index with id="index_4fd50a6fa7444cbcb672db636d811f4b".
short_title topic_id mentions prevalence one_sentence_summary executive_paragraph_summary topic_group_id topic_group_short_title conc entropy
0 Bayesian Machine Learning Models 82 26.0 0.104794 The theme encompasses various methodologies in... This theme explores the application of Bayesia... 2 Machine Learning 26.669127 7.259441
1 Gaussian Process Regression 26 19.0 0.095460 The theme explores the flexibility and applica... Gaussian Process (GP) regression models are a ... 1 Statistical Methods 35.986237 7.213645
2 Adaptive Slice Sampling 83 14.0 0.071136 The documents discuss methods for adaptive sli... The theme revolves around adaptive slice sampl... 0 Sampling Techniques 130.408447 6.885675
3 Exact Summation Methods 59 11.0 0.069014 The discussed methods focus on achieving high ... The provided examples illustrate advanced meth... 1 Statistical Methods 17.029112 7.432269
4 Asymptotic Variance in MCMC 80 8.0 0.066469 This theme explores methods to reduce asymptot... The examined theme highlights various techniqu... 1 Statistical Methods 20.888794 7.365510

Treemap Visualization

The following treemap visualizes the topics hierarchically: grouping the topics by the high level topic group. The size of each topic is porportional to the percentage of the time that topics shows up in Radford Neal’s publications.

fig = px.treemap(df, path=["topic_group_short_title", "short_title"], values="prevalence", hover_data=["topic_id"])
procFig(fig, height=500).show()

Integrating Topics with Metadata

Let’s say we are interested in learning more about the years during which Radford Neal published papers on Adaptive Slice Samling. The topic information has been converted into a tabular format that we can directly query via sql. We expose the tables via the queryMeta api. If we choose to, we can do all of our semantic analysis directly in sql.

row = df.loc[df.short_title == "Adaptive Slice Sampling"]
row
short_title topic_id mentions prevalence one_sentence_summary executive_paragraph_summary topic_group_id topic_group_short_title conc entropy
2 Adaptive Slice Sampling 83 14.0 0.071136 The documents discuss methods for adaptive sli... The theme revolves around adaptive slice sampl... 0 Sampling Techniques 130.408447 6.885675
row = row.iloc[0]

df = index.queryMeta(f"""
SELECT 
    year(published::DATE) as year, 
    count(*) as publications
FROM doc
WHERE sparse_list_extract({row.topic_id+1}, sum_topic_counts_inds, sum_topic_counts_vals) > 2.0
GROUP BY year
ORDER BY year
""")
fig = px.bar(df, x="year", y="publications", title=f"'{row.short_title}' Publications over Time",)
procFig(fig, title_x=.5)

Multi Topic Analysis

While it is possible to reconstruct our apis from scratch, the topicSearch is extremely helpful for simple multi-topic analysis. Just as you can do semantic analysis with SQL, you can also pass SQL to our topic apis.

Below we are going to query the topical content of every ticker, quarter combination that discusses AI Infrastructure with a simple for loop. Below pull out Radford Neals research focuses over each five year period of his career.

SEARCH_QUERY=""
dfs = []
for year in index.queryMeta("SELECT distinct ( (year(published::DATE)//5)*5 ) as year  FROM doc").year.dropna():
    tmp = index.topicSearch(SEARCH_QUERY, f"(year(published::DATE)::INT//5)*5 = {int(year)}").head(30)
    tmp["year"] = int(year)
    dfs.append(tmp)

df = pd.concat(dfs).rename(columns=dict(mentions="publications"))
df.sample(5)[["short_title", "topic_id", "publications", "year"]]
short_title topic_id publications year
26 Factorial Design Theory 84 0.0 2005
11 Mixture Model Techniques 76 1.0 1995
2 Adaptive Slice Sampling 83 3.0 2000
20 Anthropic Reasoning Challenges 61 0.0 1990
1 Embedded HMM Methods 63 1.0 2015

Research Bar Plots

Below visualze all of Radford Neal’s research topics broken down by time.

import duckdb
fig = px.bar(
    duckdb.sql("SELECT * FROM df ORDER BY year asc, publications desc").to_df(), 
    x="year", 
    y="publications", 
    color="short_title", 
    title=f"Radford Neal Publications over Time",
)
procFig(fig, title_x=.5, height=500)

Normalized

We can visualize the raw counts (mentions) or we can also visualize the prevalence field, which is the percentage of the total corpus that each topic makes up. Because we passed in a filter our topic search query, this prevalence is normalized by 5 year intervals

import duckdb
fig = px.bar(
    duckdb.sql("SELECT * FROM df ORDER BY year asc, prevalence desc").to_df(), 
    x="year", 
    y="prevalence", 
    color="short_title", 
    title=f"Radford Neal Publications over Time",
)
procFig(fig, title_x=.5, height=500)

Unlock Your Unstructured Data Today

from sturdystats import Index

index = Index("Custom Analysis")
index.upload(df.to_dict("records"))
index.commit()
index.train()

# Ready to Explore 
index.topicSearch()

More Examples