forked from kubeedge/sedna
-
Notifications
You must be signed in to change notification settings - Fork 1
Add pod template like support #2
Copy link
Copy link
Open
Description
pod-template like support at worker:
current state
the spec definitions of worker:
type WorkerSpec struct {
ScriptDir string `json:"scriptDir"`
ScriptBootFile string `json:"scriptBootFile"`
FrameworkType string `json:"frameworkType"`
FrameworkVersion string `json:"frameworkVersion"`
Parameters []ParaSpec `json:"parameters"`
}
// ParaSpec is a description of a parameter
type ParaSpec struct {
Key string `json:"key"`
Value string `json:"value"`
}ScriptDir/ScriptBootFileis the entrypoint of worker, localpath or central storage(e.g. s3).FrameworkType/FrameworkVersionspecifies the base container image of worker.Parametersspecifies the environment of worker.
pros
- simply for demo
cons
- don't support docker-container cap: code version mgmt, distribution etc.
- don't support k8s pod similar features: resource limits, user defined volumes etc.
- need central storage(e.g. s3) for code if not localpath.
- need to build base image if the current base image can't satisfy the user
requirements(user-defined code package dependents, or new framework).
And then reedit the configuration of GM and restart it.
proposals: Add pod template support for workers
proposal 1: just pod template
And deprecate the current ScriptDir.
import v1 "k8s.io/api/core/v1"
type WorkerSpec struct {
v1.PodTemplateSpec `json:",inline"`
}examples and discussions
joint-inference-service
so in this proposal, the example of joint-inference in here would be:
apiVersion: sedna.io/v1alpha1
kind: JointInferenceService
metadata:
name: example
spec:
edgeWorker:
model:
name: "small-model"
nodeName: "edge0"
hardExampleMining:
name: "IBT"
workerSpec:
containers:
- image: edge-inference-worker:latest
imagePullPolicy: Always
env: # user defined environments
- name: nms_threshold
value: "0.6"
ports: # user defined ports
- containerPort: 80
protocol: TCP
resources: # user defined resources
requests:
memory: 64Mi
cpu: 100m
limits:
memory: 512Mi
volumeMounts:
- name: localvideo
mountPath: /data/
volumes: # user defined volumes
- name: localvideo
emptyDir: {}
cloudWorker:
model:
name: "big-model"
nodeName: "solar-corona-cloud"
workerSpec:
containers:
- image: cloud-inference-worker:latest
imagePullPolicy: Always
env: # user defined environments
- name: nms_threshold
value: "0.6"
ports: # user defined ports
- containerPort: 80
protocol: TCP
resources: # user defined resources
limits:
memory: 2Gisomething need to discuss for joint inference service:
- where's the resource limits of model? share with the container resource limits?
- where's the serving container-side port of cloudworker?
- cloudWorker's workerSpec is needed? the user may only specify the big model.
federated-learning-job
so in this proposal, the example of joint-inference
in here
would be:
apiVersion: sedna.io/v1alpha1
kind: FederatedLearningJob
metadata:
name: surface-defect-detection
spec:
aggregationWorker:
model:
name: "surface-defect-detection-model"
nodeName: "cloud0"
# where's the serving port of aggregator worker
workerSpec:
containers:
- image: aggregator-worker:latest
imagePullPolicy: Always
env: # user defined environments
- name: exit_round
value: "0.3"
ports:
- containerPort: 80
protocol: TCP
resources: # user defined resources
requests:
memory: 64Mi
cpu: 100m
limits:
memory: 512Mi
trainingWorkers:
- nodeName: "edge1"
dataset:
name: "edge-1-surface-defect-detection-dataset"
workerSpec:
containers:
- image: training-worker:latest
imagePullPolicy: Always
env: # user defined environments
- name: batch_size
value: "0.3"
- name: learning_rate
value: "0.001"
- name: epochs
value: "1"
resources: # user defined resources
requests:
memory: 64Mi
cpu: 100m
limits:
memory: 512Mi
- nodeName: "edge2"
dataset:
name: "edge-2-surface-defect-detection-dataset"
workerSpec:
containers:
- image: training-worker:latest
imagePullPolicy: Always
env: # user defined environments
- name: batch_size
value: "0.3"
- name: learning_rate
value: "0.001"
- name: epochs
value: "1"
resources: # user defined resources
requests:
memory: 64Mi
cpu: 100m
limits:
memory: 512Miincremental-learning-job
the common problem:
- find to a good way to write the openapi of crd since podSpec has a lot of fields.
deployment support
using the feature of deployment:
- replica pod in case pod failure
alternative: using replicaSet
type DeploymentSpec struct {
Replicas *int32 `json:"replicas,omitempty"`
Template WorkerSpec `json:"template"`
// etc.
}daemonset support
use case:
- running training worker of federated learning in every node of a group.
type DaemonsetSpec struct {
Selector *metav1.LabelSelector `json:"selector"`
Template WorkerSpec `json:"template"`
// etc.
}Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels