Skip to content

Conversation

@joshuanapoli
Copy link
Member

@joshuanapoli joshuanapoli commented May 12, 2025

Summary

This repo is meant to be made public and published on PyPI as "cvec". The public package should make it easy for clients to import it.

  • Add SDK functions for read-only access to spans and metrics.
  • In this version, the SDK directly integrates with the CVector Timescale database.
  • The client connects using a tenant-specific user, with strictly limited permissions.
  • Later, these implementations will move into our back-end. The back-end API will avoid breaking clients when we redesign our database.

Span

A span is a period of interest, such as an experiment, a baseline recording session, or an alarm. The initial state of a Span is implicitly defined by a period where a given metric has a constant value.

The newest span for a metric does not have an end time, since it has not ended yet (or has not ended by the finish of the queried period).

In a future version, spans are mutable. An API will allow the client to annotate metrics, and edit the start/end times.

Metric

A metric is a named set of time-series data points pertaining to a particular resource (for example, the value reported by a sensor). A metric has a lifecycle of being activated or added to the system (birth_at) and later removed from the system (death_at). Metrics can have numeric or string values. Boolean values are mapped to 0 and 1.

Testing

  • Consider whether anything here is confidential, and should not be published to PyPI.
  • What do you think of the names "span" and "metric"?
  • Does the definition of span and metric make sense?
  • Is the SDK "pythonic"?
  • Are there useless comments, which should be removed?
  • Construct test cases and check that the SDK outputs make sense.

@joshuanapoli joshuanapoli marked this pull request as ready for review May 15, 2025 13:28
@joshuanapoli joshuanapoli changed the title Initial version: direct database access Initial version: SDK for spans and metrics May 15, 2025

# None indicates that the end time is not known; the span extends beyond
# the query period.
raw_end_at = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not mean the first Span object's raw_end_time is always None? The function signature specified that if end_at is given that will be the raw_end_time of the newest Span

Copy link
Member Author

@joshuanapoli joshuanapoli May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the documentation to match the implemented behavior in ecb633d

Copy link
Member

@richardzyx richardzyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really significant effort, appreciate the hard work on this! Concept make sense to me, code interface seems fine.

One comment/question is on build & release pipeline: I assume we'd want to package this into a .whl wheel file with .pyc compiled py bytecode? Optionally we can also obfuscate the source code with pyminifier or something similar?

There's also the option of distributing to private registry like AWS Codeartifact, but controlling access seems to be painful with the right security measures.

@joshuanapoli joshuanapoli changed the title Initial version: SDK for spans and metrics [PD1-296] Initial version: SDK for spans and metrics May 16, 2025
@linear
Copy link

linear bot commented May 16, 2025

PD1-296 Publish CVector SDK for Ammobia

Initial Release

Data Model

This SDK integrates directly with CVector's database. Each tenant has a schema and a database user, both named for the tenant. The API Key is the password of the user. The database user is restricted to only have access to the tenant's schema. Here are the available database tables:

CREATE TABLE tag_data (
    tag_name_id INTEGER NOT NULL,
    tag_value_changed_at TIMESTAMP WITH TIME ZONE,
    tag_value DOUBLE PRECISION
)
CREATE TABLE tag_data_str (
    tag_name_id INTEGER NOT NULL,
    tag_value_changed_at timestamptz NOT NULL,
    tag_value text
);
CREATE TABLE tag_names (
    id SERIAL PRIMARY KEY,
    normalized_name VARCHAR NOT NULL,
    birth_at TIMESTAMPTZ NULL,
    death_at TIMESTAMPTZ NULL
);
CREATE VIEW metrics AS
 SELECT td.tag_value AS value,
    td.tag_value_changed_at AS "time",
    tn.normalized_name AS metric
   FROM tag_data td
     JOIN tag_names tn ON td.tag_name_id = tn.id;

CVec Class

The SDK provides an API client class named CVec with the following functions.

  • __init__(?host, ?tenant, ?api_key, ?default_time_range)
    Setup the SDK with the given host and API Key. The host and API key are loaded from environment variables CVEC_HOST, CVEC_TENANT, CVEC_API_KEY, if they are not given as arguments to the constructor. The default_time_range constrains most API keys, and can be overridden by the time_range argument to each API function.
  • get_spans(tag_name, ?time_range, ?limit)
    Return all of the time spans where a tag has a constant value. The function returns a list of time-ranges with the value for each time-range. Returns a list of spans. Each span has the following fields: {id, tag_name, value, begin_at, end_at, raw_begin_at, raw_end_at, metadata}. In a future version of the SDK, spans can be annotated, edited, and deleted.
  • get_metric_data(?tag_names, ?time_range)
    Return all data-points within a given time-range, optionally selecting a given list of tags. The return value is a Pandas DataFrame with three columns: tag_name, time, value. One row is returned for each tag value transition.
  • get_tags(?time_range)
    Return a list of tags that had at least one transition in the given time range. All tags are returned if no time_range is given. Each tag has {id, name, birth_at, death_at}.

Future Features

Out of scope for this issue.

  • sample_metric_data(bucket_width, ?bucket_function, ?aggregate_function, ?bucket_offset, ?tag_names, ?time_range)
    Get metric data, resampled on regular time buckets. The only supported bucket_function is LOCF, meaning Last Observation Carried Forward. The only supported aggregate_function is AVERAGE. The function returns a Pandas DataFrame with a column for each tag_name, plus a time column.
  • Island detection with user-defined criteria: instead of this, define a synthetic tag based on a function of other tag values. Then use get_spans based on the synthetic tag.

@joshuanapoli
Copy link
Member Author

One comment/question is on build & release pipeline: I assume we'd want to package this into a .whl wheel file with .pyc compiled py bytecode? Optionally we can also obfuscate the source code with pyminifier or something similar?

The pyc files are specific to particular versions of Python. PEP 3147 was implement and adds the possibility of distributing a library using only pyc files (by compiling for every version of Python), but I can't find any distribution tool that creates this kind of archive. The wisdom of the internet: "If you don't want to distribute source, then you shouldn't use Python."

I'll add pyminifier with the obfuscate option, remove the schema documentation, and won't make this repo public.

@joshuanapoli
Copy link
Member Author

Optionally we can also obfuscate the source code with pyminifier or something similar?

I tried pyminifier, but it makes the package unusable. I'll see if I can move the implementation to the back-end, so that this library doesn't expose our database structure.

@joshuanapoli joshuanapoli merged commit d99e101 into main May 19, 2025
4 checks passed
@joshuanapoli joshuanapoli deleted the jn/initial branch May 19, 2025 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants