Sturdy Statistics API Documentation (1.0.0)

Download OpenAPI specification:Download

License: unlicensed

Documentation for the suite Sturdy Statistics API solutions.

api/text/v1/index

Create Index

Creates a new index. An index is the core data structure for storing data. Once the index is trained (see documentation for the train endpoint), an index may also be used to search, query, and analyze data. If an index with the provided name already exists, no index will be created and the metadata of that index will be returned.

Authorizations:
api_key
Request Body schema: application/json
name
string

Responses

Request samples

Content type
application/json
{
  • "name": "string"
}

Response samples

Content type
application/json
{
  • "index_id": "index_a3cd8f52a42b4ee3841dacfe9408d4cd",
  • "name": "Index Name",
  • "state": "untrained",
  • "already_exists": false
}

List Indices

Returns a list of all indices tied to your API key.

Authorizations:
api_key
header Parameters
api_key
required
string

API Key.

Responses

Response samples

Content type
application/json
[
  • {
    }
]

Single Index Info

Returns all metadata belonging to the specified index.

Authorizations:
api_key
path Parameters
index_id
required
string
query Parameters
api_key
required
string

API Key.

Responses

Response samples

Content type
application/json
{
  • "index_id": "index_a3cd8f52a42b4ee3841dacfe9408d4cd",
  • "name": "Index_Name_1",
  • "state": "untrained"
}

Upload Docs

Uploads documents to a temporary staging index for processing and/or storage. Documents are processed by the AI model if the index has been trained (see the train endpoint), and are stored if the parameter save=true.

Documents are provided as a list of dictionaries. The content of each document must be plain text and is provided under the required field doc. You may provide a unique document identifier under the optional field doc_id. If no doc_id is provided, we will create an identifier by hashing the contents of the document. Documents can be updated via an upsert mechanism that matches on doc_id. If doc_id is not provided and two docs have identical content, the most recently uploaded document will upsert the previously uploaded document.

Any additional fields in the document dictionary are stored within the index as metadata and will become available for training and querying tasks (see documentation for the train and query endpoints). While arbitrary metadata may be included and later queried, "binary" and "tag" data are required for supervised training of an index (no metadata is required for unsupervised training). See the documentation for label_field_names and tag_field_names in the train endpoint for more information about the required formats.

If the index has been trained, the response contains predictions produced by the trained AI model for each document. In order to obtain predictions without locking or mutating the index, set save=false.
If you wish to update an existing doc's metadata, you can do a shallow update by leaving the doc content field either empty or unchanged. The shallow update will the skip the model inference step and only update/append the metadata fields passed into the api. This provides a significant speedup to the api call.

Uploading docs is a locking operation. A client cannot call upload, train or commit while an upload is already in progress. Consequently, the operation is more efficient with batches of documents. The API supports a batch size of up to 1000 documents at a time. The larger the batch size, the more efficient the upload.

Nota Bene

Uploaded documents are saved in a staging index. The index is unaffected until a commit request is sent.

Authorizations:
api_key
path Parameters
index_id
required
string
Request Body schema: application/json
save
boolean

If true (default), save the docs to the staging index. If false, drop the docs after processing. This is useful for obtaining prediction results without locking or changing the index.

Array of objects

Responses

Request samples

Content type
application/json
{
  • "save": true,
  • "docs": [
    ]
}

Response samples

Content type
application/json
[
  • {
    },
  • {
    }
]

Query Docs

Queries the specified index. Queries are flexible, and may contain any combination of:

  1. SQL conditionals applied to metadata stored in the index. See documentation for the filters parameter. Note that this option requires metadata to be stored in the index; see documentation for the upload endpoint.
  2. Thematic grouping as identified by the trained AI model. You may filter by either a single topic group (topic_group_id) or by an arbitrary list of specific topics (topic_ids). Note that these options require the index to have been trained. You may filter by either topic_ids or by topic_group_id but not both simultaneously.
  3. Search. The query may include a search term to filter documents. The search is semantic, based on inferred thematic content, but terms enclosed in double quotes will impose an exact match filter. See documentation for the query parameter.

The API returns a dictionary containing three different entries. Each entry provides a distinct view into how the data in the index matches your query, with increasing degrees of abstraction:

  1. A ranked list of document objects. This provides a direct list of documents matching the query. By default these documents are sorted by relevance, but you may optionally specify any metadata field to sort by (see documentation for the sort_by parameter). Each result contains a short excerpt from the document that evinces your query; see documentation for the summarize_by parameter for information about this excerpt. Each result also contains all prediction values associated with the document, all metadata fields associated with the document, and an ordered list of the topics associated with the document.
  2. A ranked list of topic objects. This thematic information is abstracted from the documents in the index, but still provides a granular summary of the content present in the data which matches the query. These topic objects are sorted by prevalence (i.e., % occurrence), not relevance (i.e., degree of match to the query). Each topic has a topic_id that can be used for additional downstream queries (see documentation for the topic_ids parameter).
  3. A ranked list of topic group objects. This provides the highest-level overview of the content in the index which matches the query. These are also sorted by prevalence. Each topic group has a topic_group_id that can be used for additional downstream queries (see documentation for the topic_group_id parameter).
Authorizations:
api_key
path Parameters
index_id
required
string
query Parameters
query
string

A search query that can be used to filter or sort document objects. By default the search will support a fuzzy match. Any word wrapped in double quotes "word" will be treated as an exact match filter.

topic_ids
string

Supports filtering on a single or multiple topics. Topics are unsupervised granular themes inferred from the clients data at training time that can be used to index the data. Expected input format is a comma separated list of topic_ids eg 1,12,25

topic_group_id
integer

Supports filtering on a single topic group. Topic groups are unsupervised high level (rather than granular) categories learned from the clients data at training time that can be used to index the data. Expected input format is a single integer.

filters
string

filters is a string of SQL conditionals that defines the boolean criteria for your query. The filters clause supports any operation available in duckdb. The filters can operate on any metadata you have uploaded and on any prediction values tied to your data. Example filter -- published > "2024-01-01" AND pred_sale > .8

sort_by
string

Define a field by which to sort. By default, the docs will be sorted by relevance. The client can choose to sort by any value present in its metadata.

ascending
boolean
context
integer

The number of paragraphs above and below the selected excerpt to return.

limit
integer
offset
integer
header Parameters
api_key
required
string

Api Key

Responses

Response samples

Content type
application/json
[
  • {
    }
]

Query Docs by ID

Loads a specified set of docs from the index. Supports a comma separated list of up to 500 doc_ids at a time. Additionally, the api provides the ability to specify the summary style of the docs by providing either a comma separated list of topic_ids, a topic_group_id, or a search query. The api will extract the most thematically relevant section of the doc to return. The API returns a dictionary containing field docs that contains a list of document objects in the order specified in the query.

Authorizations:
api_key
path Parameters
index_id
required
string
doc_id
required
string

A comma separated list of doc_ids. The api supports up to 500 per request.

query Parameters
query
string

A search query that can be used to summarize document objects.

topic_ids
string

Supports matching on a single or multiple topics for granular summaries Topics are unsupervised granular themes inferred from the clients data at training time that can be used to index the data. Expected input format is a comma separated list of topic_ids eg 1,12,25

topic_group_id
integer

Supports matching on a single topic group for high leve summaries. Topic groups are unsupervised high level (rather than granular) categories learned from the clients data at training time that can be used to index the data. Expected input format is a single integer.

context
integer

The number of paragraphs above and below the selected excerpt to return.

header Parameters
api_key
required
string

Api Key

Responses

Response samples

Content type
application/json
[
  • {
    }
]

Query Doc Meta

This api allows you to directly run arbitrary SQL queries against you index's metadata. The availab fields include doc_id as well as any metadata you uploaded to your documents.

Authorizations:
api_key
path Parameters
index_id
required
string
query Parameters
query
required
string

query is a SQL query that defines the data you are interested in. The statment must query the doc_meta table for document level information or the doc_meta_para table for paragraph level information. The SQL query supports any operation available in duckdb. Example query -- SELECT doc_id FROM doc_meta WHERE published > "2024-01-01" AND pred_sale > .8

header Parameters
api_key
required
string

API Key.

Responses

Response samples

Content type
application/json
[
  • {
    }
]

Commit Docs

Permanently applies all changes made to the staging index to the production index. Note that upload saves new documents only to the staging index; a commit is necessary to use those documents for querying or for training. This is a locking operation: no data can be uploaded, trained, or committed while a commit is in progress.

Authorizations:
api_key
path Parameters
index_id
required
string
Request Body schema: application/json
api_key
string

Responses

Request samples

Content type
application/json
{
  • "api_key": "string"
}

Response samples

Content type
application/json
{
  • "job_id": "a3cd8f52a42b4ee3841dacfe9408d4cd"
}

Unstage Docs

Reverts all changes to the staging index back, and resets the staging index to match the state of the production index.

Authorizations:
api_key
path Parameters
index_id
required
string
Request Body schema: application/json
api_key
string

Responses

Request samples

Content type
application/json
{
  • "api_key": "string"
}

Response samples

Content type
application/json
{
  • "job_id": "a3cd8f52a42b4ee3841dacfe9408d4cd"
}

Clone Index

Create a deep copy of the selected index.

Authorizations:
api_key
path Parameters
index_id
required
string
Request Body schema: application/json
new_name
string

Responses

Request samples

Content type
application/json
{
  • "new_name": "string"
}

Response samples

Content type
application/json
{
  • "job_id": "a3cd8f52a42b4ee3841dacfe9408d4cd"
}

Delete Index

Create a deep copy of the selected index.

Authorizations:
api_key
path Parameters
index_id
required
string
Request Body schema: application/json
Schema not provided

Responses

Response samples

Content type
application/json
{
  • "job_id": "a3cd8f52a42b4ee3841dacfe9408d4cd"
}

Train Index

Trains an AI model on all documents in the production index. Once an index has been trained, documents are queryable (see documentation for the query endpoint), and the model automatically processes subsequently uploaded documents (see documentation for the upload endpoint).

The AI model identifies thematic information in documents, permitting semantic indexing and semantic search. It also enables quantitative analysis of, e.g., topic trends.

The AI model may optionally be supervised using metadata present in the index. Thematic decomposition of the data is not unique; supervision guides the model and aligns the identified topics to your intended application. Supervision also allows the model to make predictions.

Data for supervision may be supplied explicitly using the label_field_names parameter. Metadata field names listed in this parameter must each store data in a ternary true/false/unknown format. For convenience, supervision data may also be supplied in a sparse "tag" format using the tag_field_names parameter. Metadata field names listed in this parameter must contain a list of labels for each document. The document is considered "true" for each label listed; it is implicitly considered "false" for each label not listed. Consequently, the "tag" format does not allow for unknown labels. Any combination of label_field_names and tag_field_names may be supplied.

Authorizations:
api_key
path Parameters
index_id
required
string
Request Body schema: application/json
label_field_names
Array of strings

A list of fields that denote binary labels. The model will use these fields as training data and predict their values for all future docs to be uploaded. Valid values for field1 in each doc are 1,0,-1,True,False,NULL

Example ["field1","field2"]. Predictions will be made written to pred_field1 and pred_field2.

tag_field_names
Array of strings

A list of fields that contain tags. E.g. if a doc has an attribute genre it might be tagged with string fiction,non-fiction,sci-fi. The presence of a tag implies a True and the absence of a tag implies a False for training input. If that is not that case, consider manually converting your tags to binary and passing those fields into the label_field_names

doc_hierarchy
Array of strings

This is used for adding hierarchy to the indexing model by leveraging attributes present in the uploaded data. This is a more advanced feature for those familiar with Bayesian analysis.

K
string

This parameter sets the maximum number of topics to learn from the data. We support a range of 32-512, with a default of 192. The runtime of training is linear with the number of topics.

Responses

Request samples

Content type
application/json
{
  • "label_field_names": [
    ],
  • "tag_field_names": [
    ],
  • "doc_hierarchy": [
    ],
  • "K": "string"
}

Response samples

Content type
application/json
{
  • "job_id": "a3cd8f52a42b4ee3841dacfe9408d4cd"
}

Topic Diff

Returns most conditionally prevalent topics information.

Authorizations:
api_key
path Parameters
index_id
required
string
query Parameters
q1
required
string

q1 is a string of SQL conditionals that defines the boolean criteria for your query. This query should denote the subset of data for which you are interested obtaining a high level summary.

The SQL clause supports any operation available in duckdb. The SQL clause can operate on any metadata you have uploaded and on any prediction values tied to your data. Example q1 filter -- published > "2024-01-01" AND pred_sale > .8

q2
string

q2 is a string of SQL conditionals that defines the boolean criteria for your query. This query should denote the set of data against which you want to compare your q1 filter. This field is optional and by default q1 is compared against the entire corpus.
The SQL clause supports any operation available in duckdb. The SQL clause can operate on any metadata you have uploaded and on any prediction values tied to your data. Example q1 filter -- published > "2024-01-01" AND pred_sale > .8

min_confidence
number

Sets a minimum confidence threshold between 0-1 that a topic changed when comparing q1 to q2.

header Parameters
api_key
required
string

API Key.

Responses

Response samples

Content type
application/json
{
  • "topics": [
    ]
}

Topic Search

Returns most conditionally prevalent topics information.

Authorizations:
api_key
path Parameters
index_id
required
string
query Parameters
query
string

A search query that can be used to summarize document objects.

filters
string

filters is a string of SQL conditionals that defines the boolean criteria for your query. The filters clause supports any operation available in duckdb. The filters can operate on any metadata you have uploaded and on any prediction values tied to your data. Example filter -- published > "2024-01-01" AND pred_sale > .8

doc_id
required
string

A comma separated list of doc_ids. The api supports up to 500 per request.

header Parameters
api_key
required
string

API Key.

Responses

Response samples

Content type
application/json
{
  • "topics": [
    ]
}

api/text/v1/job

Job Info

Returns the status of the current job.

Authorizations:
api_key
path Parameters
job_id
required
string
query Parameters
api_key
required
string

API Key.

Responses

Response samples

Content type
application/json
{
  • "finishedAt": "2024-11-05T01:41:47.637000+00:00",
  • "index_id": "index_05a7cb07da764f1aa399ce65ab06",
  • "job_id": "7e5154f4-cfcd-4fc7-94c3-5a168f83",
  • "job_name": "commitIndexV2",
  • "result": {
    },
  • "startedAt": "2024-11-05T01:41:26.835000+00:00",
  • "status": "SUCCEEDED"
}

List Jobs

List all jobs matching the filter criteria

Authorizations:
api_key
query Parameters
api_key
required
string

API Key.

index_id
string

Filter on a specified index_id.

status
string

Filter on job status of the job. One of 'PENDING', 'RUNNING', 'CANCELLED', 'SUCCEEDED', 'FAILED'.

job_name
string

Filter on the name of the job.

Responses

Response samples

Content type
application/json
[
  • {
    }
]