train a model on large dataset with gitlab-ci.yaml

Hi, 

First of all, thank you for the nice tools developed by you. I am trying to create a training ML workflow with gitlab, CML and DVC with MinIO storage as my remote backup where I have my training dataset stored. my `.gitlab-ci.yaml` looks like this:
```
stages:
  - cml_run
cml:
  stage: cml_run
  image: dvcorg/cml:0-dvc2-base1-gpu

  script:
  - echo 'Hi from CML' >> report.md
  - apt-get update && apt-get install -y python3-opencv
  - pip3 install -r requirements.txt
  - dvc remote add -d minio_data s3://bucket/dataset/
  - dvc remote modify minio_data endpointurl http://<MINIOSERVER_IP_ADDRESS>:9000
  - dvc remote modify minio_data use_ssl False
  - export AWS_ACCESS_KEY_ID="xxxxxxx"
  - export AWS_SECRET_ACCESS_KEY="xxxxxxx"
  - dvc pull -r minio_data
  - python main.py
  - cml-send-comment report.md --repo=https://<my_gitlab_repo_url>
 ```

My setup is configured as following:

- A gitlab self-hosted **runner** listening for job (works: Ubuntu 20.04, 2 x RTX 3070 GPUs, ).
- An S3 MinIO storage server configured as DVC remote local backup (works with my credentials).
- A training script (works). 

My workflow is working and I am able to train my model on the runner and queue jobs but I have the following issues with it (maybe there is a better way to do this, hence I am here asking for directions):

1.  For each training job, the entire dataset is pulled from the remote and then the model is trained. This is really slow. It is my requirement to keep using dvc for data versioning but is there a way to bypass the dataset pull `dvc pull -r minio_data` everytime and use the same data between different training jobs? (maybe mount volumes to the docker container?)
2.  For MinIO authentication, I do not want to put my credentials as in **AWS_SECRET_ACCESS_KEY** in the `.gitlab-ci.yaml` file, in case more than one person want to use this workflow to queue their training jobs in a collaborative environment. What other options do I have?  
3.  Is there a way to configure a local container registry cache for the runner (and this worflow) where I can put all the necessary docker images and use them instead of adding dependencies to the workflow like I am doing and let docker handle it? 

Any feedback or suggestions would be appreciated. Thank you. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train a model on large dataset with gitlab-ci.yaml #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

train a model on large dataset with gitlab-ci.yaml #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions