SDK Index Reference
The Index
class is the core data structure for storing, searching, and analyzing text data in SturdyStats. This class provides methods for creating and managing indices, uploading documents, training models, and querying data.
Installation
pip install sturdystats
Basic Usage
from sturdystats.index import Index
# Create or access an existing index
= Index(name="my_index", API_key="your_api_key")
index
# Upload documents
= [{
docs "doc": "Operator: Welcome, everyone. Thank you for standing by for the Alphabet Fourth Quarter 2023 Earnings Conference Call ...",
"date": "2024-01-30",
"quarter": "2024Q1",
"author": "GOOG"
},
...
]
index.upload(docs)
# Train the model
=True)
index.train(fast
# Query documents
= index.query(search_query="document") results
Class Constructor
Index()
def __init__(
str] = os.environ.get("STURDY_STATS_API_KEY"),
API_key: Optional[str] = None,
name: Optional[id: Optional[str] = None,
str] = None,
_base_url: Optional[bool = True
verbose: )
Parameters:
API_key
: Your SturdyStats API key. If not provided, it will look for theSTURDY_STATS_API_KEY
environment variable.name
: The name of the index to create or access.id
: The ID of an existing index to access._base_url
: Optional override for the API base URL. Used for self-hosting.verbose
: Whether to print status messages (default: True).
Notes:
- You must provide either
name
orid
, but not both. - If the index does not exist, it will be created.
- If the index exists, it will be accessed with its current state.
Upload
Documents
upload()
Uploads documents to the index and optionally commits them for permanent storage.
def upload(
records: Iterable[Dict],int = 1000,
batch_size: bool = True
commit: -> list[dict] )
Parameters:
records
: An iterable of document dictionaries. Each dictionary must have adoc
field containing the document text.batch_size
: Maximum number of documents to upload in a single batch (default: 1000).commit
: Whether to commit the changes after uploading (default: True).
Returns:
- A list of dictionaries with the results of the upload operation.
Notes:
- Each document can contain a
doc_id
field with a unique identifier. If not provided, the system will generate one by hashing the document content. - Documents are upserted based on
doc_id
. If two documents have the same ID, the newer one replaces the older one. - The maximum batch size is 1000 documents.
- This is a locking operation - you cannot call upload, train, or commit while an upload is in progress.
Integrations
ingestIntegration()
Ingests data from various external sources.
def ingestIntegration(
"academic_search", "hackernews_comments", "hackernews_story",
engine: Literal["earnings_calls", "author_cn", "news_date_split", "google",
"google_news", "reddit", "cn_all"],
str,
query: str | None = None,
start_date: str | None = None,
end_date: dict = dict(),
args: bool = True,
commit: bool = True
wait: -> Job | dict )
Parameters:
engine
: Source engine to use for ingestion.query
: Search query for the engine.start_date
: Optional start date for filtering results.end_date
: Optional end date for filtering results.args
: Additional arguments for the engine.commit
: Whether to commit changes after ingestion.wait
: Whether to wait for ingestion to complete before returning.
Returns:
- If
wait=True
, returns a dictionary with the ingestion results. - If
wait=False
, returns aJob
object to track progress.
Predict
predict()
Runs predictions on documents without saving them to the index.
def predict(
records: Iterable[Dict],int = 1000
batch_size: -> list[dict] )
Parameters:
records
: An iterable of document dictionaries to predict on.batch_size
: Maximum number of documents to process in a single batch.
Returns:
- A list of dictionaries with predictions for each document.
Notes:
- This function does not mutate the index in any way.
- The index must be in the “ready” state (trained) to use this function.
Commit
commit()
Commits changes from the staging index to the permanent index.
def commit(
bool = True
wait: -> Job | dict )
Parameters:
wait
: Whether to wait for the commit to complete before returning.
Returns:
- If
wait=True
, returns a dictionary with the commit results. - If
wait=False
, returns aJob
object that can be used to check the status of the commit job.
Unstage
unstage()
Discards changes in the staging index.
def unstage(
bool = True
wait: -> Job | dict )
Parameters:
wait
: Whether to wait for the unstage operation to complete before returning.
Returns:
- If
wait=True
, returns a dictionary with the operation results. - If
wait=False
, returns aJob
object that can be used to check the status.
Train
train()
Trains an AI model on all documents in the index.
def train(
= dict(),
params: Dict bool = False,
fast: bool = False,
force: bool = True
wait: -> Job | dict )
Parameters:
params
: Dictionary of training parameters.fast
: Whether to use faster training settings (default: False).force
: Whether to force training even if the index is already trained (default: False).wait
: Whether to wait for training to complete before returning (default: True).
Returns:
- If
wait=True
, returns a dictionary with the training results. - If
wait=False
, returns aJob
object that can be used to check the status of the training job.
Notes:
- After training, documents are queryable, and the model automatically processes subsequently uploaded documents.
- The AI model identifies thematic information in documents, enabling semantic search and quantitative analysis.
- The model can be supervised using metadata from the index.
Query
Documents
query()
Searches for documents in the index.
def query(
str] = None,
search_query: Optional[int] = None,
topic_id: Optional[int] = None,
topic_group_id: Optional[str = "",
filters: int = 0,
offset: int = 20,
limit: str = "relevance",
sort_by: bool = False,
ascending: int = 0,
context: int = 5,
max_excerpts_per_doc: float = .3,
semantic_search_weight: float = .1,
semantic_search_cutoff: dict = dict(),
override_args: bool = True
return_df: -> pd.DataFrame )
Parameters:
search_query
: Text query for semantic search.topic_id
: ID of a specific topic to filter by.topic_group_id
: ID of a topic group to filter by.filters
: Filter query string.offset
: Number of results to skip (for pagination).limit
: Maximum number of results to return.sort_by
: Field to sort results by (default: “relevance”).ascending
: Whether to sort in ascending order.context
: Number of sentences of context to include around matching text.max_excerpts_per_doc
: Maximum number of excerpts to return per document.semantic_search_weight
: Weight to give semantic search vs. keyword search (0-1).semantic_search_cutoff
: Minimum relevance score for semantic search results.override_args
: Additional arguments to pass to the API.return_df
: Whether to return results as a pandas DataFrame (default: True).
Returns:
- A pandas DataFrame with the query results, or a list of dictionaries if
return_df=False
.
Metadata
queryMeta()
Runs SQL queries on document metadata.
def queryMeta(
str,
query: str = "",
search_query: float = .3,
semantic_search_weight: float = .1,
semantic_search_cutoff: dict = dict(),
override_args: bool = True,
return_df: bool = False
paginate: -> pd.DataFrame )
Parameters:
query
: SQL-like query to run on metadata.search_query
: Optional text query for filtering.semantic_search_weight
: Weight for semantic search.semantic_search_cutoff
: Minimum relevance score.override_args
: Additional arguments for the API.return_df
: Whether to return results as a pandas DataFrame.paginate
: Whether to automatically paginate results.
Returns:
- A pandas DataFrame with query results, or a list if
return_df=False
.
Topic Analysis
topicSearch()
Searches for topics in the index.
def topicSearch(
str = "",
query: str = "",
filters: int = 100,
limit: float = .3,
semantic_search_weight: float = .1,
semantic_search_cutoff: dict = dict(),
override_args: bool = True
return_df: -> pd.DataFrame )
Parameters:
query
: Text query for searching topics.filters
: Filter query string.limit
: Maximum number of topics to return.semantic_search_weight
: Weight for semantic search.semantic_search_cutoff
: Minimum relevance score.override_args
: Additional arguments for the API.return_df
: Whether to return results as a pandas DataFrame.
Returns:
- A pandas DataFrame with matching topics, or a list if
return_df=False
.
topicDiff()
Compares topics between two filtered subsets of the data.
def topicDiff(
str = "",
filter1: str = "",
filter2: str = "",
search_query1: str = "",
search_query2: int = 50,
limit: float = 1.0,
cutoff: float = 95,
min_confidence: float = .3,
semantic_search_weight: float = .1,
semantic_search_cutoff: dict = dict(),
override_args: bool = True
return_df: -> pd.DataFrame )
Parameters:
filter1
: Filter for the first subset.filter2
: Filter for the second subset.search_query1
: Search query for the first subset.search_query2
: Search query for the second subset.limit
: Maximum number of topics to return.cutoff
: Minimum difference to include a topic.min_confidence
: Minimum confidence level for differences.semantic_search_weight
: Weight for semantic search.semantic_search_cutoff
: Minimum relevance score.override_args
: Additional arguments for the API.return_df
: Whether to return results as a pandas DataFrame.
Returns:
- A pandas DataFrame with topic differences, or a list if
return_df=False
.
Document Operations
deleteDocs()
Deletes documents from the index by their IDs.
def deleteDocs(
list[str],
doc_ids: dict = dict()
override_args: -> dict )
Parameters:
doc_ids
: List of document IDs to delete.override_args
: Additional arguments to pass to the API.
Returns:
- A dictionary with the results of the delete operation.
getDocs()
Retrieves specific documents by their IDs.
def getDocs(
list[str],
doc_ids: str] = None,
search_query: Optional[int] = None,
topic_id: Optional[int] = None,
topic_group_id: Optional[int = 5,
max_excerpts_per_doc: int = 0,
context: dict = dict(),
override_args: bool = True
return_df: -> pd.DataFrame )
Parameters:
doc_ids
: List of document IDs to retrieve.search_query
: Optional text query to highlight matching parts.topic_id
: Optional topic ID to filter excerpts by.topic_group_id
: Optional topic group ID to filter excerpts by.max_excerpts_per_doc
: Maximum number of excerpts to return per document.context
: Number of sentences of context to include around matching text.override_args
: Additional arguments to pass to the API.return_df
: Whether to return results as a pandas DataFrame.
Returns:
- A pandas DataFrame with the documents, or a list of dictionaries if
return_df=False
.
getDocsBinary()
Retrieves documents in spaCy DocBin format.
def getDocsBinary(
list[str]
doc_ids: -> DocBin )
Parameters:
doc_ids
: List of document IDs to retrieve.
Returns:
- A spaCy DocBin object containing the documents.
Metadata Operations
getPandata()
Retrieves index-wide metadata in MessagePack format.
def getPandata() -> dict
Returns:
- A dictionary with index metadata.
Index Management
get_status()
Gets the current status of the index.
def get_status() -> dict
Returns:
- A dictionary with the index status and metadata.
annotate()
Triggers annotation of the index data.
def annotate()
clone()
Creates a copy of the index.
def clone(new_name) -> dict
Parameters:
new_name
: Name for the cloned index.
Returns:
- A dictionary with the result of the clone operation.
delete()
Deletes the index.
def delete(force: bool) -> dict
Parameters:
force
: Must be True to confirm deletion.
Returns:
- A dictionary with the result of the delete operation.
listIndices()
Lists all available indices.
def listIndices(
str] = None,
name_filter: Optional[str] = None,
state_filter: Optional[bool = True
return_df: -> pd.DataFrame )
Parameters:
name_filter
: Filter indices by name.state_filter
: Filter indices by state.return_df
: Whether to return results as a pandas DataFrame.
Returns:
- A pandas DataFrame with index information, or a list if
return_df=False
.
Job Management
listJobs()
Lists jobs associated with the index.
def listJobs(
str= "RUNNING",
status: str] = None,
job_name: Optional[bool = True,
only_current_index: bool = True
return_df: -> pd.DataFrame )
Parameters:
status
: Filter jobs by status (“RUNNING”, “FAILED”, “SUCCEEDED”, “PENDING”, “CANCELLED”).job_name
: Filter jobs by name.only_current_index
: Whether to only show jobs for the current index.return_df
: Whether to return results as a pandas DataFrame.
Returns:
- A pandas DataFrame with job information, or a list if
return_df=False
.