SDK Index Reference

The Index class is the core data structure for storing, searching, and analyzing text data in SturdyStats. This class provides methods for creating and managing indices, uploading documents, training models, and querying data.

Installation

pip install sturdy-stats-sdk

Basic Usage

from sturdystats.index import Index

# Create or access an existing index
index = Index(name="my_index", API_key="your_api_key")

# Upload documents
docs = [{
    "doc": "Operator: Welcome, everyone. Thank you for standing by for the Alphabet Fourth Quarter 2023 Earnings Conference Call ...", 
    "date": "2024-01-30", 
    "quarter": "2024Q1",
    "author": "GOOG"
    },
    ...
]
index.upload(docs)

# Train the model
index.train(fast=True)

# Query documents
results = index.query(search_query="document")

Class Constructor

`Index()`

def __init__(
    API_key: Optional[str] = os.environ.get("STURDY_STATS_API_KEY"),
    name: Optional[str] = None,
    id: Optional[str] = None,
    _base_url: Optional[str] = None,
    verbose: bool = True
)

Parameters:

API_key: Your SturdyStats API key. If not provided, it will look for the STURDY_STATS_API_KEY environment variable.
name: The name of the index to create or access.
id: The ID of an existing index to access.
_base_url: Optional override for the API base URL. Used for self-hosting.
verbose: Whether to print status messages (default: True).

Notes:

You must provide either name or id, but not both.
If the index does not exist, it will be created.
If the index exists, it will be accessed with its current state.

Upload

Documents

`upload()`

Uploads documents to the index and optionally commits them for permanent storage.

def upload(
    records: Iterable[Dict],
    batch_size: int = 1000,
    commit: bool = True
) -> list[dict]

Parameters:

records: An iterable of document dictionaries. Each dictionary must have a doc field containing the document text.
batch_size: Maximum number of documents to upload in a single batch (default: 1000).
commit: Whether to commit the changes after uploading (default: True).

Returns:

A list of dictionaries with the results of the upload operation.

Notes:

Each document can contain a doc_id field with a unique identifier. If not provided, the system will generate one by hashing the document content.
Documents are upserted based on doc_id. If two documents have the same ID, the newer one replaces the older one.
The maximum batch size is 1000 documents.
This is a locking operation - you cannot call upload, train, or commit while an upload is in progress.

Integrations

`ingestIntegration()`

Ingests data from various external sources.

def ingestIntegration(
    engine: Literal["academic_search", "hackernews_comments", "hackernews_story", 
                   "earnings_calls", "author_cn", "news_date_split", "google", 
                   "google_news", "reddit", "cn_all"],
    query: str,
    start_date: str | None = None, 
    end_date: str | None = None,
    args: dict = dict(),
    commit: bool = True,
    wait: bool = True
) -> Job | dict

Parameters:

engine: Source engine to use for ingestion.
query: Search query for the engine.
start_date: Optional start date for filtering results.
end_date: Optional end date for filtering results.
args: Additional arguments for the engine.
commit: Whether to commit changes after ingestion.
wait: Whether to wait for ingestion to complete before returning.

Returns:

If wait=True, returns a dictionary with the ingestion results.
If wait=False, returns a Job object to track progress.

Predict

`predict()`

Runs predictions on documents without saving them to the index.

def predict(
    records: Iterable[Dict],
    batch_size: int = 1000
) -> list[dict]

Parameters:

records: An iterable of document dictionaries to predict on.
batch_size: Maximum number of documents to process in a single batch.

Returns:

A list of dictionaries with predictions for each document.

Notes:

This function does not mutate the index in any way.
The index must be in the “ready” state (trained) to use this function.

Commit

`commit()`

Commits changes from the staging index to the permanent index.

def commit(
    wait: bool = True
) -> Job | dict

Parameters:

wait: Whether to wait for the commit to complete before returning.

Returns:

If wait=True, returns a dictionary with the commit results.
If wait=False, returns a Job object that can be used to check the status of the commit job.

Unstage

`unstage()`

Discards changes in the staging index.

def unstage(
    wait: bool = True
) -> Job | dict

Parameters:

wait: Whether to wait for the unstage operation to complete before returning.

Returns:

If wait=True, returns a dictionary with the operation results.
If wait=False, returns a Job object that can be used to check the status.

Train

`train()`

Trains an AI model on all documents in the index.

def train(
    params: Dict = dict(),
    fast: bool = False,
    force: bool = False,
    wait: bool = True
) -> Job | dict

Parameters:

params: Dictionary of training parameters.
fast: Whether to use faster training settings (default: False).
force: Whether to force training even if the index is already trained (default: False).
wait: Whether to wait for training to complete before returning (default: True).

Returns:

If wait=True, returns a dictionary with the training results.
If wait=False, returns a Job object that can be used to check the status of the training job.

Notes:

After training, documents are queryable, and the model automatically processes subsequently uploaded documents.
The AI model identifies thematic information in documents, enabling semantic search and quantitative analysis.
The model can be supervised using metadata from the index.

Query

Documents

`query()`

Searches for documents in the index.

def query(
    search_query: Optional[str] = None,
    topic_id: Optional[int] = None,
    topic_group_id: Optional[int] = None,
    filters: str = "",
    offset: int = 0,
    limit: int = 20,
    sort_by: str = "relevance",
    ascending: bool = False,
    context: int = 0,
    max_excerpts_per_doc: int = 5,
    semantic_search_weight: float = .3,
    semantic_search_cutoff: float = .1,
    override_args: dict = dict(),
    return_df: bool = True
) -> pd.DataFrame

Parameters:

search_query: Text query for semantic search.
topic_id: ID of a specific topic to filter by.
topic_group_id: ID of a topic group to filter by.
filters: Filter query string.
offset: Number of results to skip (for pagination).
limit: Maximum number of results to return.
sort_by: Field to sort results by (default: “relevance”).
ascending: Whether to sort in ascending order.
context: Number of sentences of context to include around matching text.
max_excerpts_per_doc: Maximum number of excerpts to return per document.
semantic_search_weight: Weight to give semantic search vs. keyword search (0-1).
semantic_search_cutoff: Minimum relevance score for semantic search results.
override_args: Additional arguments to pass to the API.
return_df: Whether to return results as a pandas DataFrame (default: True).

Returns:

A pandas DataFrame with the query results, or a list of dictionaries if return_df=False.

Metadata

`queryMeta()`

Runs SQL queries on document metadata.

def queryMeta(
    query: str, 
    search_query: str = "",
    semantic_search_weight: float = .3,
    semantic_search_cutoff: float = .1,
    override_args: dict = dict(),
    return_df: bool = True,
    paginate: bool = False
) -> pd.DataFrame

Parameters:

query: SQL-like query to run on metadata.
search_query: Optional text query for filtering.
semantic_search_weight: Weight for semantic search.
semantic_search_cutoff: Minimum relevance score.
override_args: Additional arguments for the API.
return_df: Whether to return results as a pandas DataFrame.
paginate: Whether to automatically paginate results.

Returns:

A pandas DataFrame with query results, or a list if return_df=False.

Topic Analysis

`topicSearch()`

Searches for topics in the index.

def topicSearch(
    query: str = "",
    filters: str = "",
    limit: int = 100,
    semantic_search_weight: float = .3,
    semantic_search_cutoff: float = .1,
    override_args: dict = dict(),
    return_df: bool = True
) -> pd.DataFrame

Parameters:

query: Text query for searching topics.
filters: Filter query string.
limit: Maximum number of topics to return.
semantic_search_weight: Weight for semantic search.
semantic_search_cutoff: Minimum relevance score.
override_args: Additional arguments for the API.
return_df: Whether to return results as a pandas DataFrame.

Returns:

A pandas DataFrame with matching topics, or a list if return_df=False.

`topicDiff()`

Compares topics between two filtered subsets of the data.

def topicDiff(
    filter1: str = "",
    filter2: str = "",
    search_query1: str = "",
    search_query2: str = "",
    limit: int = 50,
    cutoff: float = 1.0,
    min_confidence: float = 95,
    semantic_search_weight: float = .3,
    semantic_search_cutoff: float = .1,
    override_args: dict = dict(),
    return_df: bool = True
) -> pd.DataFrame

Parameters:

filter1: Filter for the first subset.
filter2: Filter for the second subset.
search_query1: Search query for the first subset.
search_query2: Search query for the second subset.
limit: Maximum number of topics to return.
cutoff: Minimum difference to include a topic.
min_confidence: Minimum confidence level for differences.
semantic_search_weight: Weight for semantic search.
semantic_search_cutoff: Minimum relevance score.
override_args: Additional arguments for the API.
return_df: Whether to return results as a pandas DataFrame.

Returns:

A pandas DataFrame with topic differences, or a list if return_df=False.

Document Operations

`deleteDocs()`

Deletes documents from the index by their IDs.

def deleteDocs(
    doc_ids: list[str],
    override_args: dict = dict()
) -> dict

Parameters:

doc_ids: List of document IDs to delete.
override_args: Additional arguments to pass to the API.

Returns:

A dictionary with the results of the delete operation.

`getDocs()`

Retrieves specific documents by their IDs.

def getDocs(
    doc_ids: list[str],
    search_query: Optional[str] = None,
    topic_id: Optional[int] = None,
    topic_group_id: Optional[int] = None,
    max_excerpts_per_doc: int = 5,
    context: int = 0,
    override_args: dict = dict(),
    return_df: bool = True
) -> pd.DataFrame

Parameters:

doc_ids: List of document IDs to retrieve.
search_query: Optional text query to highlight matching parts.
topic_id: Optional topic ID to filter excerpts by.
topic_group_id: Optional topic group ID to filter excerpts by.
max_excerpts_per_doc: Maximum number of excerpts to return per document.
context: Number of sentences of context to include around matching text.
override_args: Additional arguments to pass to the API.
return_df: Whether to return results as a pandas DataFrame.

Returns:

A pandas DataFrame with the documents, or a list of dictionaries if return_df=False.

`getDocsBinary()`

Retrieves documents in spaCy DocBin format.

def getDocsBinary(
    doc_ids: list[str]
) -> DocBin

Parameters:

doc_ids: List of document IDs to retrieve.

Returns:

A spaCy DocBin object containing the documents.

Metadata Operations

`getPandata()`

Retrieves index-wide metadata in MessagePack format.

def getPandata() -> dict

Returns:

A dictionary with index metadata.

Index Management

`get_status()`

Gets the current status of the index.

def get_status() -> dict

Returns:

A dictionary with the index status and metadata.

`annotate()`

Triggers annotation of the index data.

def annotate()

`clone()`

Creates a copy of the index.

def clone(new_name) -> dict

Parameters:

new_name: Name for the cloned index.

Returns:

A dictionary with the result of the clone operation.

`delete()`

Deletes the index.

def delete(force: bool) -> dict

Parameters:

force: Must be True to confirm deletion.

Returns:

A dictionary with the result of the delete operation.

`listIndices()`

Lists all available indices.

def listIndices(
    name_filter: Optional[str] = None,
    state_filter: Optional[str] = None,
    return_df: bool = True
) -> pd.DataFrame

Parameters:

name_filter: Filter indices by name.
state_filter: Filter indices by state.
return_df: Whether to return results as a pandas DataFrame.

Returns:

A pandas DataFrame with index information, or a list if return_df=False.

Job Management

`listJobs()`

Lists jobs associated with the index.

def listJobs(
    status: str= "RUNNING",
    job_name: Optional[str] = None,
    only_current_index: bool = True,
    return_df: bool = True
) -> pd.DataFrame

Parameters:

status: Filter jobs by status (“RUNNING”, “FAILED”, “SUCCEEDED”, “PENDING”, “CANCELLED”).
job_name: Filter jobs by name.
only_current_index: Whether to only show jobs for the current index.
return_df: Whether to return results as a pandas DataFrame.

Returns:

A pandas DataFrame with job information, or a list if return_df=False.