Transformer Architecture Structured Citations

Academic

Advanced

Integration

Author

Kian Ghodoussi

Published

March 6, 2025

Prerequisites

pip install sturdy-stats-sdk pandas numpy plotly duckdb

Code

from IPython.display import display, Markdown, Latex
import pandas as pd
import numpy as np
import plotly.express as px
from sturdystats import Index, Job

from pprint import pprint

Code

## Basic Utilities
px.defaults.template = "simple_white"  # Change the template
px.defaults.color_discrete_sequence = px.colors.qualitative.Dark24 # Change color sequence

def procFig(fig, **kwargs):
    fig.update_layout(plot_bgcolor= "rgba(0, 0, 0, 0)", paper_bgcolor= "rgba(0, 0, 0, 0)",
        margin=dict(l=0,r=0,b=0,t=30,pad=0),
        **kwargs
    )
    fig.layout.xaxis.fixedrange = True
    fig.layout.yaxis.fixedrange = True
    return fig

[Optional] Train Your Own Index

index = Index(id="index_6095a26fc2be4674a005778dd8bcd5e5")

# Uncomment the line below to create and train your own index
# index = Index(name="transformer_archictecture")

if index.get_status()["state"] == "untrained":
    index.ingestIntegration("academic_search", "transformer architecture")
    index.train(dict(burn_in=1200, subdoc_hierarchy=False), fast=True)
    print(job.get_status())
    # job.wait() # Sleeps until job finishes

Found an existing index with id="index_6095a26fc2be4674a005778dd8bcd5e5".

Explore Topics

Our bayesian probabilistic model learns a set of high level topics from your corpus. These topics are completely custom to your data, whether your dataset has hundreds of documents or billions. The model then maps this set of learned topics to single every word, sentence, paragraph, document, and group of documents to your dataset, providing a powerful semantic indexing.

This indexing enables us to store data in a granular, structured tabular format. This structured format enables rapid analysis to complex questions.

index = Index(id="index_6095a26fc2be4674a005778dd8bcd5e5")
topic_df = index.topicSearch()
topic_df.head()

Found an existing index with id="index_6095a26fc2be4674a005778dd8bcd5e5".

	short_title	topic_id	mentions	prevalence	one_sentence_summary	executive_paragraph_summary	topic_group_id	topic_group_short_title	conc	entropy
0	Transformers in Vision Tasks	24	318.0	0.118941	The integration of transformer architectures w...	Recent advancements in image processing have h...	0	Transformers and Architectures	72.880173	6.118217
1	Advancements in Image Classification Architect...	58	180.0	0.059627	Recent developments in image classification ha...	The evolving landscape of image classification...	7	Trends and Developments	36.206444	6.273045
2	Transformer-based Object Detection	80	144.0	0.041205	This theme focuses on leveraging transformer a...	The theme centers on the application of transf...	6	Applications and Technologies	13.590480	6.414909
3	Transformer Advancements	15	148.0	0.040112	The rapid advancement of Transformer models is...	Recent advancements in Transformer models, ori...	5	Transformer Innovations	43.570381	6.714420
4	Advancements in NLP Models	65	146.0	0.037816	The development and adaptation of transformer-...	Recent advancements in natural language proces...	4	Machine Learning Techniques	13.264362	6.135605

Treemap Visualization

The following treemap visualizes the topics hierarchically: grouping the topics by the high level topic group. The size of each topic is porportional to the percentage of the time that topics shows up within papers about Transformer Architectures

fig = px.treemap(topic_df, path=["topic_group_short_title", "short_title"], values="prevalence", hover_data=["topic_id"])
procFig(fig, height=500).show()

Integrating Topics with Metadata

Selecting a Topic

Let’s say we are interested in learning more about the years during which Radford Neal published papers on Adaptive Slice Samling. The topic information has been converted into a tabular format that we can directly query via sql. We expose the tables via the queryMeta api. If we choose to, we can do all of our semantic analysis directly in sql.

row = topic_df.loc[topic_df.short_title == "Graph Transformers"]
row

	short_title	topic_id	mentions	prevalence	one_sentence_summary	executive_paragraph_summary	topic_group_id	topic_group_short_title	conc	entropy
13	Graph Transformers	68	51.0	0.01789	The theme explores the advancements and challe...	Recent developments in leveraging Transformer ...	4	Machine Learning Techniques	25.215822	6.30969

Trends over time

Below we query the doc table to aggregate the number of research publications that have at least 2 words assigned to our topic of interest. Depending on the use case, you may want more of fewer words as a threshold. We then perform an aggregation on the publication year. And we now have a rapid semantic analysis of the number of publications the mention Graph transformers each year.

df = index.queryMeta(f"""
SELECT 
    year(published::DATE) as year, 
    count(*) as publications
FROM doc
WHERE sparse_list_extract(
    {row.iloc[0].topic_id+1},  -- 1 indexing
    sum_topic_counts_inds, 
    sum_topic_counts_vals
    ) > 2.0
GROUP BY year
ORDER BY year
""")
fig = px.bar(df, x="year", y="publications", 
             title=f"'{row.iloc[0].short_title}' Publications over Time", )
procFig(fig, title_x=.5)

Imbuing more Metadata

Because the semantic annotations are stored side by side with the metadata, we can further enrich our visualization and insights. We also have access to the citationCount of each publication.

df = index.queryMeta(f"""
SELECT 
    year(published::DATE) as year, 
    count(*) as publications,
    median(citationCount) as median_citationCount
FROM doc
WHERE sparse_list_extract({row.iloc[0].topic_id+1}, sum_topic_counts_inds, sum_topic_counts_vals) > 2.0
GROUP BY year
ORDER BY year
""")
fig = px.bar(df, x="year", y="publications", 
             color="median_citationCount", color_continuous_scale="blues",
             title=f"'{row.iloc[0].short_title}' Publications over Time", )
procFig(fig, title_x=.5)

Semantic Citation Analysis

So far we have only looked at one topic at a time, However, we can perform much more complex analyses on our Our topics are stored in a tabular format alongside the metadata. This unified data storage enriches metadata with semantic information and enriches semantic information with structured context. As a result, we can perform complex semantic analysis with simple structured SQL queries

Citations by Topic

Below, we run a sql query to load the number of citations papers receive according to the sets of topics they belong to. The field theta is a sparse array broken up into theta_inds and theta_vals. theta_inds designates the list of topic_ids that appear in a document. theta_vals designates for each topic, what percentage of the document it comprises.

For each document, we unpack its topics. We then split its citations across topics porportional to the percentage of the document the topic comprises. We then do an agregration by topic id. This returns to us a list of topics and the median citation count associated with that topic

topicCitationsQuery = f"""
WITH t1 AS (
    SELECT 
        unnest(theta_inds)-1 as topic_id, -- 1 indexed
        unnest(theta_vals) as topic_pct,
        citationCount,
        unnest(theta_vals)*citationCount as citationCount,
        year(published::DATE) as year
    FROM doc
)
SELECT 
    topic_id,
    count(*) as publications,
    median(citationCount) as median_citationCount
FROM t1
WHERE topic_pct > .2
GROUP BY topic_id
ORDER BY median_citationCount desc
"""

topicCitations = index.queryMeta(topicCitationsQuery)
topicCitations.head(5)

	topic_id	publications	median_citationCount
0	70	4	2192.0
1	59	2	1820.0
2	54	2	591.0
3	3	13	521.0
4	35	4	393.5

Visualization

We can now return to our old Treemap visualization and imbue it with new information. Instead of assigning a color to each topic group, we can instead use a color scale to designate the median number of citations each topic received in our corpus. We convert the citation counts to log scale to support plotly’s built in linear continuous color scale.

The richer the blue, the more citations that topic tends to get. The richer the red, the fewer it tends to get.

import duckdb

def buildDF(topicCitations):
    topic_df = index.topicSearch()
    df = duckdb.sql("""SELECT
        topic_df.*,
        topicCitations.publications,
        log(topicCitations.median_citationCount+1) as "log Median Citation Count"
    FROM topic_df
    INNER JOIN topicCitations 
    ON topic_df.topic_id=topicCitations.topic_id
    """).to_df()
    
    ## For nicer color scale: most papers are between 10->1000
    df["log Median Citation Count"] = df["log Median Citation Count"].clip(1,3)
    return df

df = buildDF(topicCitations)
df["title"] = "Transformer Architectures Topics"
fig = px.treemap(df, path=["title", "topic_group_short_title", "short_title"], 
                 values="publications", hover_data=["topic_id"], 
                 color="log Median Citation Count", color_continuous_scale="rdbu")
fig = procFig(fig, height=500)
fig

Integrated Search

We can create a whole family of visualization by simply adding in a search_query to our initial queryMeta query statement. We can explore specific fields and subfields within an existing model instantly.

SEARCH_QUERY = "classification"
topicCitations = index.queryMeta(topicCitationsQuery, SEARCH_QUERY, semantic_search_cutoff=.5)

df = buildDF(topicCitations)
df["title"] = f"'{SEARCH_QUERY}' Topic Citation Map"
fig = px.treemap(df, path=["title", "topic_group_short_title", "short_title"], 
                 values="publications", hover_data=["topic_id"], 
                 color="log Median Citation Count", color_continuous_scale="rdbu")
fig = procFig(fig, height=500)
fig

Unlock Your Unstructured Data Today

from sturdystats import Index

index = Index("Custom Analysis")
index.upload(df.to_dict("records"))
index.commit()
index.train()

# Ready to Explore 
index.topicSearch()

More Examples

Prerequisites

[Optional] Train Your Own Index

Explore Topics

Treemap Visualization

Integrating Topics with Metadata

Selecting a Topic

Trends over time

Imbuing more Metadata

Semantic Citation Analysis

Citations by Topic

Visualization

Integrated Search

Unlock Your Unstructured Data Today

More Examples

Changes in Novartis’ News Coverage

How ArXiv Machine Learning Publications Have Changed This Decade

HackerNews’ Discussion on DuckDB vs Pandas

Kanye West News

Radford Neal’s Publications over Time

Nested Hierarchical Structuring of Tech Earnings Calls

The Academic & Religious Impact of Wave Function of the Universe

Missing Links in Biomes Citation Network Analysis