Code
from IPython.display import display, Markdown, Latex
import pandas as pd
import numpy as np
import plotly.express as px
from sturdystats import Index, Job
from pprint import pprint
pip install sturdy-stats-sdk pandas numpy plotly duckdb
from IPython.display import display, Markdown, Latex
import pandas as pd
import numpy as np
import plotly.express as px
from sturdystats import Index, Job
from pprint import pprint
## Basic Utilities
= "simple_white" # Change the template
px.defaults.template = px.colors.qualitative.Dark24 # Change color sequence
px.defaults.color_discrete_sequence
def procFig(fig, **kwargs):
= "rgba(0, 0, 0, 0)", paper_bgcolor= "rgba(0, 0, 0, 0)",
fig.update_layout(plot_bgcolor=dict(l=0,r=0,b=0,t=30,pad=0),
margin**kwargs
)= True
fig.layout.xaxis.fixedrange = True
fig.layout.yaxis.fixedrange return fig
= Index(id="index_6095a26fc2be4674a005778dd8bcd5e5")
index
# Uncomment the line below to create and train your own index
# index = Index(name="transformer_archictecture")
if index.get_status()["state"] == "untrained":
"academic_search", "transformer architecture")
index.ingestIntegration(dict(burn_in=1200, subdoc_hierarchy=False), fast=True)
index.train(print(job.get_status())
# job.wait() # Sleeps until job finishes
Found an existing index with id="index_6095a26fc2be4674a005778dd8bcd5e5".
Our bayesian probabilistic model learns a set of high level topics from your corpus. These topics are completely custom to your data, whether your dataset has hundreds of documents or billions. The model then maps this set of learned topics to single every word, sentence, paragraph, document, and group of documents to your dataset, providing a powerful semantic indexing.
This indexing enables us to store data in a granular, structured tabular format. This structured format enables rapid analysis to complex questions.
= Index(id="index_6095a26fc2be4674a005778dd8bcd5e5")
index = index.topicSearch()
topic_df topic_df.head()
Found an existing index with id="index_6095a26fc2be4674a005778dd8bcd5e5".
short_title | topic_id | mentions | prevalence | one_sentence_summary | executive_paragraph_summary | topic_group_id | topic_group_short_title | conc | entropy | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Transformers in Vision Tasks | 24 | 318.0 | 0.118941 | The integration of transformer architectures w... | Recent advancements in image processing have h... | 0 | Transformers and Architectures | 72.880173 | 6.118217 |
1 | Advancements in Image Classification Architect... | 58 | 180.0 | 0.059627 | Recent developments in image classification ha... | The evolving landscape of image classification... | 7 | Trends and Developments | 36.206444 | 6.273045 |
2 | Transformer-based Object Detection | 80 | 144.0 | 0.041205 | This theme focuses on leveraging transformer a... | The theme centers on the application of transf... | 6 | Applications and Technologies | 13.590480 | 6.414909 |
3 | Transformer Advancements | 15 | 148.0 | 0.040112 | The rapid advancement of Transformer models is... | Recent advancements in Transformer models, ori... | 5 | Transformer Innovations | 43.570381 | 6.714420 |
4 | Advancements in NLP Models | 65 | 146.0 | 0.037816 | The development and adaptation of transformer-... | Recent advancements in natural language proces... | 4 | Machine Learning Techniques | 13.264362 | 6.135605 |
The following treemap visualizes the topics hierarchically: grouping the topics by the high level topic group. The size of each topic is porportional to the percentage of the time that topics shows up within papers about Transformer Architectures
= px.treemap(topic_df, path=["topic_group_short_title", "short_title"], values="prevalence", hover_data=["topic_id"])
fig =500).show() procFig(fig, height
Let’s say we are interested in learning more about the years during which Radford Neal published papers on Adaptive Slice Samling
. The topic information has been converted into a tabular format that we can directly query via sql. We expose the tables via the queryMeta api. If we choose to, we can do all of our semantic analysis directly in sql.
= topic_df.loc[topic_df.short_title == "Graph Transformers"]
row row
short_title | topic_id | mentions | prevalence | one_sentence_summary | executive_paragraph_summary | topic_group_id | topic_group_short_title | conc | entropy | |
---|---|---|---|---|---|---|---|---|---|---|
13 | Graph Transformers | 68 | 51.0 | 0.01789 | The theme explores the advancements and challe... | Recent developments in leveraging Transformer ... | 4 | Machine Learning Techniques | 25.215822 | 6.30969 |
Below we query the doc
table to aggregate the number of research publications that have at least 2 words assigned to our topic of interest. Depending on the use case, you may want more of fewer words as a threshold. We then perform an aggregation on the publication year. And we now have a rapid semantic analysis of the number of publications the mention Graph transformers each year.
= index.queryMeta(f"""
df SELECT
year(published::DATE) as year,
count(*) as publications
FROM doc
WHERE sparse_list_extract(
{row.iloc[0].topic_id+1}, -- 1 indexing
sum_topic_counts_inds,
sum_topic_counts_vals
) > 2.0
GROUP BY year
ORDER BY year
""")
= px.bar(df, x="year", y="publications",
fig =f"'{row.iloc[0].short_title}' Publications over Time", )
title=.5) procFig(fig, title_x
Because the semantic annotations are stored side by side with the metadata, we can further enrich our visualization and insights. We also have access to the citationCount
of each publication.
= index.queryMeta(f"""
df SELECT
year(published::DATE) as year,
count(*) as publications,
median(citationCount) as median_citationCount
FROM doc
WHERE sparse_list_extract({row.iloc[0].topic_id+1}, sum_topic_counts_inds, sum_topic_counts_vals) > 2.0
GROUP BY year
ORDER BY year
""")
= px.bar(df, x="year", y="publications",
fig ="median_citationCount", color_continuous_scale="blues",
color=f"'{row.iloc[0].short_title}' Publications over Time", )
title=.5) procFig(fig, title_x
So far we have only looked at one topic at a time, However, we can perform much more complex analyses on our Our topics are stored in a tabular format alongside the metadata. This unified data storage enriches metadata with semantic information and enriches semantic information with structured context. As a result, we can perform complex semantic analysis with simple structured SQL queries
Below, we run a sql query to load the number of citations papers receive according to the sets of topics they belong to. The field theta
is a sparse array broken up into theta_inds
and theta_vals
. theta_inds
designates the list of topic_ids that appear in a document. theta_vals
designates for each topic, what percentage of the document it comprises.
For each document, we unpack its topics. We then split its citations across topics porportional to the percentage of the document the topic comprises. We then do an agregration by topic id. This returns to us a list of topics and the median citation count associated with that topic
= f"""
topicCitationsQuery WITH t1 AS (
SELECT
unnest(theta_inds)-1 as topic_id, -- 1 indexed
unnest(theta_vals) as topic_pct,
citationCount,
unnest(theta_vals)*citationCount as citationCount,
year(published::DATE) as year
FROM doc
)
SELECT
topic_id,
count(*) as publications,
median(citationCount) as median_citationCount
FROM t1
WHERE topic_pct > .2
GROUP BY topic_id
ORDER BY median_citationCount desc
"""
= index.queryMeta(topicCitationsQuery)
topicCitations 5) topicCitations.head(
topic_id | publications | median_citationCount | |
---|---|---|---|
0 | 70 | 4 | 2192.0 |
1 | 59 | 2 | 1820.0 |
2 | 54 | 2 | 591.0 |
3 | 3 | 13 | 521.0 |
4 | 35 | 4 | 393.5 |
We can now return to our old Treemap visualization and imbue it with new information. Instead of assigning a color to each topic group, we can instead use a color scale to designate the median number of citations each topic received in our corpus. We convert the citation counts to log scale to support plotly’s built in linear continuous color scale.
The richer the blue, the more citations that topic tends to get. The richer the red, the fewer it tends to get.
import duckdb
def buildDF(topicCitations):
= index.topicSearch()
topic_df = duckdb.sql("""SELECT
df topic_df.*,
topicCitations.publications,
log(topicCitations.median_citationCount+1) as "log Median Citation Count"
FROM topic_df
INNER JOIN topicCitations
ON topic_df.topic_id=topicCitations.topic_id
""").to_df()
## For nicer color scale: most papers are between 10->1000
"log Median Citation Count"] = df["log Median Citation Count"].clip(1,3)
df[return df
= buildDF(topicCitations)
df "title"] = "Transformer Architectures Topics"
df[= px.treemap(df, path=["title", "topic_group_short_title", "short_title"],
fig ="publications", hover_data=["topic_id"],
values="log Median Citation Count", color_continuous_scale="rdbu")
color= procFig(fig, height=500)
fig fig
We can create a whole family of visualization by simply adding in a search_query to our initial queryMeta
query statement. We can explore specific fields and subfields within an existing model instantly.
= "classification"
SEARCH_QUERY = index.queryMeta(topicCitationsQuery, SEARCH_QUERY, semantic_search_cutoff=.5)
topicCitations
= buildDF(topicCitations)
df "title"] = f"'{SEARCH_QUERY}' Topic Citation Map"
df[= px.treemap(df, path=["title", "topic_group_short_title", "short_title"],
fig ="publications", hover_data=["topic_id"],
values="log Median Citation Count", color_continuous_scale="rdbu")
color= procFig(fig, height=500)
fig fig
from sturdystats import Index
= Index("Custom Analysis")
index "records"))
index.upload(df.to_dict(
index.commit()
index.train()
# Ready to Explore
index.topicSearch()