Skip to content

[增量学习]支持不同节点训练/评估 #7

@llhuii

Description

@llhuii

plantuml 序列图

@startuml
'https://plantuml.com/sequence-diagram

'autonumber
actor User
participant  "K8S API" as API
participant GM

participant  "LC at dataset-node" as LC0
participant  "LC at train-node" as LC1

participant  "LC at eval-node" as LC2



GM -> API: list/watch dataset / incremental job

User -> API: Create a dataset with: \n1. s3 specified url\n2. nodeName: dataset-node
API --> User:

GM -> LC0 : sync the dataset info to the LC located in dataset-node
LC0 -> LC0 : monitor the dataset and update the dataset's status

User -> API: Create an incremental job with:\n \
              1. train worker spec with train-node\n \
              2. eval worker spec with eval-node\n \
              3. infer worker spec with nodeSelector 
API --> User:

API --> GM: watched new job

GM -> API: create infer-worker

loop incremental traning
  GM -> API: set the job state to train-waiting
  GM -> LC0: sync the job info
  loop train-trigger is not satisfied
  LC0 -> LC0: append the new-incremental samples into the job if any
  end

  LC0 -> GM: triggered, translate the job state to train-ready
  GM -> API: create train-worker
  GM -> LC1: sync the job info
  LC1 -> GM: get the message from train-worker, \ntranslate the job state to eval-ready

  GM -> API: create eval-worker
  GM -> LC2: sync the job info

  LC2 -> LC2: handle eval result:
  alt deploy-trigger is satisfied
    LC2 -> LC2: update the deploy model,\ntranslate state to deploy-ready
    GM -> API: restart the infer-worker (cold model-update)

  else no satisfied
    LC2 -> GM: translate state to no-deploy
  end
end

@enduml

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions