Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 7 additions & 17 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,10 @@

Meta-todo: Gradually move work items from here to repo Issues.

# Leftover TODOs from TADA.md
# Leftover TODOs from elsewhere

## Software

Minor:

- Improve load_dotenv() (don't look for `<repo>/ts/.env`, use one loop)

### Specifically for VTT import (minor)
Expand All @@ -22,14 +20,10 @@ Minor:

## Documentation

- Document how to reproduce the demos from the talk (and podcast)
- Document test/build/release process
- Document how to use gmail_dump.py (set up a project etc.)

Maybe later:

- Document how to run evaluations (but don't reveal all the data)

- Test/build/release process
- How to run evaluations (but don't share the data)
- Low-level APIs -- at least the key parts that are used directly by the
high-level APIs

# TODOs for fully implementing persistence through SQLite

Expand Down Expand Up @@ -79,19 +73,16 @@ Maybe later:

# From Meeting 8/12/2025 afternoon (edited)

- Indexing (knowledge extraction) operates chunk by chunk
- TimeRange always points to a TextRange
- Always import VTT, helper to convert podcast to VTT format
(Probably not, podcast format has listeners but VTT doesn't)
- Rename "Ordinal" to "Id"

# Other stuff

### Left to do here

- Look more into why the search query schema is so instable
- Look more into why the search query schema is so unstable
- Implement at least some @-commands in query.py
- More debug options (turn on/off various debug prints dynamically)

- Use pydantic.ai for model drivers

## General: Look for things marked as incomplete in source
Expand Down Expand Up @@ -124,7 +115,6 @@ Maybe later:
- Review Copilot-generated tests for sanity and minimal mocking
- Add new tests for newly added classes/methods/functions
- Coverage testing (needs to use a mix of indexing and querying)
- Automated end-to-end tests using Umesh's test data files

## Tighter types

Expand Down
12 changes: 12 additions & 0 deletions demo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Demo scripts

The files here are the scripts from
[Getting Started](../docs/getting-started.md).

- [ingest.py](ingest.py): The ingestion script.
- [query.py](query.py): The query script.
- [testdata.txt](testdata.txt): The test data.

Note that for any of this to work you need to acquire an OpenAI API key
and set some variables; see
[Environment Variables](../docs/env-vars.md).
2 changes: 1 addition & 1 deletion demo/demo.py → demo/ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def read_messages(filename) -> list[TranscriptMessage]:

async def main():
conversation = await create_conversation("demo.db", TranscriptMessage)
messages = read_messages("transcript.txt")
messages = read_messages("testdata.txt")
print(f"Indexing {len(messages)} messages...")
results = await conversation.add_messages_with_indexing(messages)
print(f"Indexed {results.messages_added} messages.")
Expand Down
File renamed without changes.
38 changes: 36 additions & 2 deletions docs/demos.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# How to Reproduce the Demos

All demos require [configuring](env-vars.md) an API key etc.

## How we did the Monty Python demo

The demo consisted of loading a number (specifically, 11) popular
Expand All @@ -19,7 +21,6 @@ This is `tools/ingest_vtt.py`. You run it as follows:
```sh
python tools/ingest_vtt.py FILE1.vtt ... FILEN.vtt -d mp.db
```
(This requires [configuring](env-vars.md) an API key etc.)

The process took maybe 15 minutes for 11 sketches.

Expand Down Expand Up @@ -72,4 +73,37 @@ used the instructions at [GeeksForGeeks

The rest of the email ingestion pipeline doesn't care where you got
your `*.eml` files from -- every email provider has its own quirks.
`

## Bonus content: Podcast demo

The podcast demo is actually the easiest to run:
The "database" is included in the repo as
`testdata/Episode_53_AdrianTchaikovsky_index*`,
and this is in fact the default "database" used by `tools/query.py`
when no `-d`/`--database` flag is given.

This "database" indexes `test/Episode_53_AdrianTchaikovsky.txt`.
It was created by a one-off script that invoked
`typeagent/podcast/podcast_ingest/ingest_podcast()`
and saved to two files by calling the `.ingest()` method on the
returned `typeagent/podcasts/podcast/Podcast` object.

Here's a brief sample session:
```sh
$ python tools/query.py
1.318s -- Using Azure OpenAI
0.054s -- Loading podcast from 'testdata/Episode_53_AdrianTchaikovsky_index'
TypeAgent demo UI 0.2 (type 'q' to exit)
TypeAgent> What did Kevin say to Adrian about science fiction?
--------------------------------------------------
Kevin Scott expressed his admiration for Adrian Tchaikovsky as his favorite science fiction author. He mentioned that Adrian has a new trilogy called The Final Architecture, and Kevin is eagerly awaiting the third book, Lords of Uncreation, which he has had on preorder for months. Kevin praised Adrian for his impressive writing skills and his ability to produce large, interesting science fiction books at a rate of about one per year.
--------------------------------------------------
TypeAgent> How was Asimov mentioned.
--------------------------------------------------
Asimov was mentioned in the context of discussing the ethical and moral issues surrounding AI development. Adrian Tchaikovsky referenced Asimov's Laws of Robotics, noting that Asimov's stories often highlight the inadequacy of these laws in governing robots.
--------------------------------------------------
TypeAgent> q
$
```

Enjoy exploring!
20 changes: 15 additions & 5 deletions docs/env-vars.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,24 @@ Typeagent currently supports two families of environment variables:

## OPENAI environment variables

The (public) OpenAI environment variables include:
The (public) OpenAI environment variables include the following:

### Required:

- `OPENAI_API_KEY`: Your secret API key that you get from the
[OpenAI dashboard](https://platform.openai.com/api-keys).
- `OPENAI_MODEL`: An environment variable introduced by
[TypeChat](https://microsoft.github.io/TypeChat/docs/examples/)
indicating the model to use (e.g.`gpt-4o`).
- `OPENAI_BASE_URL`: **Optional:** The URL for an OpenAI-compatible embedding server, e.g. [Infinity](https://github.com/michaelfeil/infinity). With this option `OPENAI_API_KEY` also needs to be set, but can be any value.
- `OPENAI_ENDPOINT`: **Optional:** The URL for an server compatible with the OpenAI Chat Completions API. Make sure the `OPENAI_MODEL` variable matches with the deployed model name, e.g. 'llama:3.2:1b'

### Optional:

- `OPENAI_BASE_URL`: The URL for an OpenAI-compatible embedding server,
e.g. [Infinity](https://github.com/michaelfeil/infinity). With this
option `OPENAI_API_KEY` also needs to be set, but can be any value.
- `OPENAI_ENDPOINT`: The URL for an server compatible with the OpenAI
Chat Completions API. Make sure the `OPENAI_MODEL` variable matches
with the deployed model name, e.g. 'llama:3.2:1b'

## Azure OpenAI environment variables

Expand All @@ -35,12 +44,13 @@ environment variables, starting with:
## Conflicts

If you set both `OPENAI_API_KEY` and `AZURE_OPENAI_API_KEY`,
plain `OPENAI` will win.
`OPENAI_API_KEY` will win.

## Other ways to specify environment variables

It is recommended to put your environment variables in a file named
`.env` in the current or parent directory.
To pick up these variables, call `typeagent.aitools.utils.load_dotenv()`
at the start of your program (before calling any typeagent functions).
(For simplicity this is not shown in [Getting Started](getting-started.md).)
(For simplicity this is not shown in
[Getting Started](getting-started.md).)
10 changes: 5 additions & 5 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,15 @@ install wheels from [PyPI](https://pypi.org).

## "Hello world" ingestion program

### 1. Create a text file named `transcript.txt`
### 1. Create a text file named `testdata.txt`

```txt
STEVE We should really make a Python library for Structured RAG.
UMESH Who would be a good person to do the Python library?
GUIDO I volunteer to do the Python library. Give me a few months.
```

### 2. Create a Python file named `demo.py`
### 2. Create a Python file named `ingest.py`

```py
from typeagent import create_conversation
Expand All @@ -48,7 +48,7 @@ def read_messages(filename) -> list[TranscriptMessage]:

async def main():
conversation = await create_conversation("demo.db", TranscriptMessage)
messages = read_messages("transcript.txt")
messages = read_messages("testdata.txt")
print(f"Indexing {len(messages)} messages...")
results = await conversation.add_messages_with_indexing(messages)
print(f"Indexed {results.messages_added} messages.")
Expand Down Expand Up @@ -77,7 +77,7 @@ Azure-hosted OpenAI models.
### 4. Run your program

```sh
$ python demo.py
$ python ingest.py
```

Expected output looks like:
Expand All @@ -86,7 +86,7 @@ Expected output looks like:
0.027s -- Using OpenAI
Indexing 3 messages...
Indexed 3 messages.
Got 26 semantic refs.
Got 24 semantic refs.
```

## "Hello world" query program
Expand Down