Conversation
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
| :param schema: The schema format of the dataset. One of | ||
| - jsonl_messages: The dataset is a JSONL file with messages in column format | ||
| :param uri: The URI of the dataset. Examples: | ||
| - file://mydata.jsonl |
There was a problem hiding this comment.
when would file:// make sense? when the server is running on the same machine as client? i'd rather not have that.
There was a problem hiding this comment.
I believe the idea was to use file:// to indicate that the dataset will be coming from a File object uploaded via file API.
file://<file-id>-> loads data from local file system via file IDhuggingface://<dataset-path>-> loads data from hugginface
There was a problem hiding this comment.
The other alternative is instead of a single uri, we have different Data class to specify different ways we can create a dataset.
class DatasetType(Enum):
huggingface = "huggingface"
file = "file"
class HuggingfaceData(BaseModel):
type: Literal[DatasetType.huggingface.value]
dataset_id: str
split: str
class FileData(BaseModel):
type: Literal[DatasetType.file.value]
file_id: str
Data = Annotated[
Union[FileData, HuggingfaceData],
Field(discriminator="type"),
]
async def register_dataset(
self,
schema: Schema,
data: Data,
metadata: Optional[Dict[str, Any]] = None,
dataset_id: Optional[str] = None,
) -> Dataset:There was a problem hiding this comment.
Synced with @hardikjshah offline. Going with the Union way with data_referencedata_source.
There was a problem hiding this comment.
maybe just source, drop the data_ ? This is inside of a dataset anyways
There was a problem hiding this comment.
I don't think we should have file:// as an allowed URI. if you want local files, you can calculate rows yourself and pass as rows. Otherwise you are relying on the client-SDK faithfully obeying this API (and we will not be able to keep this in sync on all clients without additional effort)
There was a problem hiding this comment.
I don't think we should have file:// as an allowed URI.
@ashwinb Added example URI's that is allowed
- "https://mywebsite.com/mydata.jsonl"
- "lsfs://mydata.jsonl" --> a file url
- "data:csv;base64,{base64_content}"
| class Schema(Enum): | ||
| """ | ||
| Schema of the dataset. Each type has a different column format. | ||
| :cvar messages: The dataset contains messages used for post-training. Examples: |
There was a problem hiding this comment.
A bit confused here. Does this dictate what the format of the data source should be? e.g. can I supply a csv file of type uri as data source?
There was a problem hiding this comment.
The Schema defines the column name and format. For example:
messagesschema indicate a dataset consisting of rows with a "messages" column withList[Messages]. These datasets can be used as input for finetuning.
[
{"messages": List[Message]}
]chat_completion_evalconsisting of "messages" column and "expected_answer" column. The "messages" column will be used as input to generate "generated_answer", and "expected_answer" column will be used for eval for scoring and computing metrics.
The reason we need schema is to validate that the supplied dataset can be correctly parsed by different methods: https://github.com/meta-llama/llama-stack/blob/00da911167c5f4871ce24bfdfeb554f8488001ae/llama_stack/providers/utils/common/data_schema_validator.py#L28-L53
The DataSource represent where the data can be loaded from. A csv file can be supplied via 3 ways:
- base64 encoded in the
uri.
URIDataSource(
uri="`data:csv;base64,{base64_content}`",
)-
upload csv file vie
/filesAPI defined in https://github.com/meta-llama/llama-stack/blob/main/llama_stack/apis/files/files.py -
read rows in csv file and register dataset via:
client.datasets.create(
data_reference={
"type": "rows",
"rows": [
"messages": [...],
],
},
schema="jsonl_messages",
)There was a problem hiding this comment.
Synced with @ehhuang offline. purpose might be a better name compared with schema and less confusing for users, since we could infer the schema from dataset. cc @hardikjshah for thoughts.
There was a problem hiding this comment.
Discussed offline:
- Clarification: The purpose of the
schemaparameter is to validate that the input data has the expected format. - Perhaps a more apt name would be
purpose, where in the documentation we lay out the expected schema for eachpurpose, and in the implementation ofcreatewe validate that the input data against the expected schema.
There was a problem hiding this comment.
What are all the valid schema names that we will have given what we know right now ?
- jsonl with each line being a `{"messages": []} ( tuning/messages )
- jsonl with each line being
{"text": ... }( tuning/text ) - eval with a csv ( query / expected / output ) ( eval/question_answer )
May be we use this api/purpose style to name the various schemas ?
# What does this PR do? - as title - uses "cursor" pagination scheme for iterrows [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan <img width="1226" alt="image" src="https://github.com/user-attachments/assets/3220eaac-7117-4d0a-b344-2bbb77a22065" /> [//]: # (## Documentation)
…#1657) # What does this PR do? - We need to tag DatasetIO class correctly with Datasets with the endpoint change [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan **Before** <img width="1474" alt="image" src="https://github.com/user-attachments/assets/48737317-28a3-4aa6-a1b5-e1ea680cef84" /> **After** <img width="1508" alt="image" src="https://github.com/user-attachments/assets/123322f0-a52f-47ee-99a7-ecc66c1b09ec" /> [//]: # (## Documentation)
# What does this PR do? - fix datasets api signature mis-match so that llama stack run can start [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` llama stack run ``` <img width="626" alt="image" src="https://github.com/user-attachments/assets/59072d1a-ccb6-453a-80e8-d87419896c41" /> [//]: # (## Documentation)
# What does this PR do? - fix dataset registeration & iterrows > NOTE: the URL endpoint is changed to datasetio due to flaky path routing [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/datasets/test_datasets.py ``` <img width="854" alt="image" src="https://github.com/user-attachments/assets/0168b352-1c5a-48d1-8e9a-93141d418e54" /> [//]: # (## Documentation)
# What does this PR do? - as title [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan CI ``` pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb ``` <img width="587" alt="image" src="https://github.com/user-attachments/assets/4a25f493-501e-43f4-9836-d9802223a93a" /> [//]: # (## Documentation)
|
|
||
| @webmethod(route="/datasetio/rows", method="GET") | ||
| async def get_rows_paginated( | ||
| # TODO(xiyan): there's a flakiness here where setting route to "/datasets/" here will not result in proper routing |
There was a problem hiding this comment.
nit: this is not flaky (which means it sometimes works vs. sometimes not) -- I think maybe you just mean "wonkiness" or "sadness"
There was a problem hiding this comment.
yeah, it actually is the fact that this sometimes work and sometimes do not work if I set the route to /datasets. I'm suspecting it may be due to the way we do topological sort in our resolver
| start_index: Optional[int] = None, | ||
| limit: Optional[int] = None, | ||
| ) -> IterrowsResponse: | ||
| """Get a paginated list of rows from a dataset. Uses cursor-based pagination. |
There was a problem hiding this comment.
I don't think this is using cursor-based pagination because this is just an index we are returning? Maybe just avoid saying this is cursor-based?
| # TODO: type checking against column types in dataset schema | ||
| return df | ||
|
|
||
| def load(self) -> None: |
There was a problem hiding this comment.
Where do we do validation ?
There was a problem hiding this comment.
We do validation at the eval API level:
- it will happen here: https://github.com/meta-llama/llama-stack/blob/cc48d9e9e68b4f66c770a4fa55351fcc9027ec6c/llama_stack/providers/inline/eval/meta_reference/eval.py#L87-L88
which will be added with the eval api implementation changes
| next_start_index=end if end < len(dataset_impl) else None, | ||
| ) | ||
|
|
||
| async def append_rows(self, dataset_id: str, rows: List[Dict[str, Any]]) -> None: |
There was a problem hiding this comment.
Should append_rows take source , similarly to dataset's registration ?
What is the use case where we need append ?
There was a problem hiding this comment.
I think @dineshyv added the append_rows API which is used internally for storing telemetry logs into a dataset.
But I don't feel like it is useful to be exposed to users, so we can remove it and make it an internal api in the future.
| }, | ||
| purpose=DatasetPurpose.eval_messages_answer, | ||
| source=URIDataSource( | ||
| uri="huggingface://datasets/llamastack/simpleqa?split=train", |
There was a problem hiding this comment.
Are we going to put all params in the uri ? or put them in uri_params ?
I know we went back and forth on this , but did we decide on the uri route ?
There was a problem hiding this comment.
We decided on the uri route based on @ashwinb's feedback on not having a separate HuggingfaceDataSource.
# What does this PR do? - sync SDK [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan - See llamastack/llama-stack#1573 [//]: # (## Documentation) [//]: # (- [ ] Added a Changelog entry if the change is significant)

PR Stack
Client SDK
CI
Doc
Client Usage
Test Plan