Transformer Architecture Structured Citations

Academic
Advanced
Integration
Author

Kian Ghodoussi

Published

March 6, 2025

Prerequisites

pip install sturdy-stats-sdk pandas numpy plotly duckdb

Code
from IPython.display import display, Markdown, Latex
import pandas as pd
import numpy as np
import plotly.express as px
from sturdystats import Index, Job

from pprint import pprint
Code
## Basic Utilities
px.defaults.template = "simple_white"  # Change the template
px.defaults.color_discrete_sequence = px.colors.qualitative.Dark24 # Change color sequence

def procFig(fig, **kwargs):
    fig.update_layout(plot_bgcolor= "rgba(0, 0, 0, 0)", paper_bgcolor= "rgba(0, 0, 0, 0)",
        margin=dict(l=0,r=0,b=0,t=30,pad=0),
        **kwargs
    )
    fig.layout.xaxis.fixedrange = True
    fig.layout.yaxis.fixedrange = True
    return fig

[Optional] Train Your Own Index

index = Index(id="index_6095a26fc2be4674a005778dd8bcd5e5")

# Uncomment the line below to create and train your own index
# index = Index(name="transformer_archictecture")

if index.get_status()["state"] == "untrained":
    index.ingestIntegration("academic_search", "transformer architecture")
    index.train(dict(burn_in=1200, subdoc_hierarchy=False), fast=True)
    print(job.get_status())
    # job.wait() # Sleeps until job finishes
Found an existing index with id="index_6095a26fc2be4674a005778dd8bcd5e5".

Explore Topics

Our bayesian probabilistic model learns a set of high level topics from your corpus. These topics are completely custom to your data, whether your dataset has hundreds of documents or billions. The model then maps this set of learned topics to single every word, sentence, paragraph, document, and group of documents to your dataset, providing a powerful semantic indexing.

This indexing enables us to store data in a granular, structured tabular format. This structured format enables rapid analysis to complex questions.

index = Index(id="index_6095a26fc2be4674a005778dd8bcd5e5")
topic_df = index.topicSearch()
topic_df.head()
Found an existing index with id="index_6095a26fc2be4674a005778dd8bcd5e5".
short_title topic_id mentions prevalence one_sentence_summary executive_paragraph_summary topic_group_id topic_group_short_title conc entropy
0 Transformers in Vision Tasks 24 318.0 0.118941 The integration of transformer architectures w... Recent advancements in image processing have h... 0 Transformers and Architectures 72.880173 6.118217
1 Advancements in Image Classification Architect... 58 180.0 0.059627 Recent developments in image classification ha... The evolving landscape of image classification... 7 Trends and Developments 36.206444 6.273045
2 Transformer-based Object Detection 80 144.0 0.041205 This theme focuses on leveraging transformer a... The theme centers on the application of transf... 6 Applications and Technologies 13.590480 6.414909
3 Transformer Advancements 15 148.0 0.040112 The rapid advancement of Transformer models is... Recent advancements in Transformer models, ori... 5 Transformer Innovations 43.570381 6.714420
4 Advancements in NLP Models 65 146.0 0.037816 The development and adaptation of transformer-... Recent advancements in natural language proces... 4 Machine Learning Techniques 13.264362 6.135605

Treemap Visualization

The following treemap visualizes the topics hierarchically: grouping the topics by the high level topic group. The size of each topic is porportional to the percentage of the time that topics shows up within papers about Transformer Architectures

fig = px.treemap(topic_df, path=["topic_group_short_title", "short_title"], values="prevalence", hover_data=["topic_id"])
procFig(fig, height=500).show()

Integrating Topics with Metadata

Selecting a Topic

Let’s say we are interested in learning more about the years during which Radford Neal published papers on Adaptive Slice Samling. The topic information has been converted into a tabular format that we can directly query via sql. We expose the tables via the queryMeta api. If we choose to, we can do all of our semantic analysis directly in sql.

row = topic_df.loc[topic_df.short_title == "Graph Transformers"]
row
short_title topic_id mentions prevalence one_sentence_summary executive_paragraph_summary topic_group_id topic_group_short_title conc entropy
13 Graph Transformers 68 51.0 0.01789 The theme explores the advancements and challe... Recent developments in leveraging Transformer ... 4 Machine Learning Techniques 25.215822 6.30969

Imbuing more Metadata

Because the semantic annotations are stored side by side with the metadata, we can further enrich our visualization and insights. We also have access to the citationCount of each publication.

df = index.queryMeta(f"""
SELECT 
    year(published::DATE) as year, 
    count(*) as publications,
    median(citationCount) as median_citationCount
FROM doc
WHERE sparse_list_extract({row.iloc[0].topic_id+1}, sum_topic_counts_inds, sum_topic_counts_vals) > 2.0
GROUP BY year
ORDER BY year
""")
fig = px.bar(df, x="year", y="publications", 
             color="median_citationCount", color_continuous_scale="blues",
             title=f"'{row.iloc[0].short_title}' Publications over Time", )
procFig(fig, title_x=.5)

Semantic Citation Analysis

So far we have only looked at one topic at a time, However, we can perform much more complex analyses on our Our topics are stored in a tabular format alongside the metadata. This unified data storage enriches metadata with semantic information and enriches semantic information with structured context. As a result, we can perform complex semantic analysis with simple structured SQL queries

Citations by Topic

Below, we run a sql query to load the number of citations papers receive according to the sets of topics they belong to. The field theta is a sparse array broken up into theta_inds and theta_vals. theta_inds designates the list of topic_ids that appear in a document. theta_vals designates for each topic, what percentage of the document it comprises.

For each document, we unpack its topics. We then split its citations across topics porportional to the percentage of the document the topic comprises. We then do an agregration by topic id. This returns to us a list of topics and the median citation count associated with that topic

topicCitationsQuery = f"""
WITH t1 AS (
    SELECT 
        unnest(theta_inds)-1 as topic_id, -- 1 indexed
        unnest(theta_vals) as topic_pct,
        citationCount,
        unnest(theta_vals)*citationCount as citationCount,
        year(published::DATE) as year
    FROM doc
)
SELECT 
    topic_id,
    count(*) as publications,
    median(citationCount) as median_citationCount
FROM t1
WHERE topic_pct > .2
GROUP BY topic_id
ORDER BY median_citationCount desc
"""

topicCitations = index.queryMeta(topicCitationsQuery)
topicCitations.head(5)
topic_id publications median_citationCount
0 70 4 2192.0
1 59 2 1820.0
2 54 2 591.0
3 3 13 521.0
4 35 4 393.5

Visualization

We can now return to our old Treemap visualization and imbue it with new information. Instead of assigning a color to each topic group, we can instead use a color scale to designate the median number of citations each topic received in our corpus. We convert the citation counts to log scale to support plotly’s built in linear continuous color scale.

The richer the blue, the more citations that topic tends to get. The richer the red, the fewer it tends to get.

import duckdb

def buildDF(topicCitations):
    topic_df = index.topicSearch()
    df = duckdb.sql("""SELECT
        topic_df.*,
        topicCitations.publications,
        log(topicCitations.median_citationCount+1) as "log Median Citation Count"
    FROM topic_df
    INNER JOIN topicCitations 
    ON topic_df.topic_id=topicCitations.topic_id
    """).to_df()
    
    ## For nicer color scale: most papers are between 10->1000
    df["log Median Citation Count"] = df["log Median Citation Count"].clip(1,3)
    return df

df = buildDF(topicCitations)
df["title"] = "Transformer Architectures Topics"
fig = px.treemap(df, path=["title", "topic_group_short_title", "short_title"], 
                 values="publications", hover_data=["topic_id"], 
                 color="log Median Citation Count", color_continuous_scale="rdbu")
fig = procFig(fig, height=500)
fig

Unlock Your Unstructured Data Today

from sturdystats import Index

index = Index("Custom Analysis")
index.upload(df.to_dict("records"))
index.commit()
index.train()

# Ready to Explore 
index.topicSearch()

More Examples