Skip to content

Conversation

@revit13
Copy link

@revit13 revit13 commented Aug 12, 2021

Signed-off-by: Revital Sur eres@il.ibm.com

Signed-off-by: Revital Sur <eres@il.ibm.com>
@revit13 revit13 marked this pull request as draft August 12, 2021 10:19
Signed-off-by: Revital Sur <eres@il.ibm.com>
@revit13 revit13 changed the title Update tests Update plotter to use capabilities in template list and avoid api Aug 12, 2021
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
name: ghcr.io/mesh-for-data/arrow-flight-config-module:0.1.0
capabilities:
- capability: read
scope: cluster

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scope: workload

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied it from arrow-flight-config-module.yaml. I will change it there also. Thanks

Copy link
Collaborator

@simanadler simanadler Aug 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@revit13 Scenario 3 describes a multi-tenant flight server. The scope should be cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simanadler Scenario 3 describes a config module for a multi-tenant server. Its scope is workload as it configures the server per workload. We do not deploy the server itself in the plotter/blueprint so the cluster scope is irrelevant.

Copy link
Author

@revit13 revit13 Aug 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shlomitk1 if we have pre-installed flight service in running one cluster and we use it from different clusters in the same application (read datasets that reside on different clusters, I am not sure that this is a good example) then in each cluster we will need to deploy the module which configs the flight server to read the dataset. I am trying to see if such an example explains the cluster scope...

name: ghcr.io/mesh-for-data/m4d-arrow-flight-transform-conf:0.1.0
capabilities:
- capability: transform
scope: asset
Copy link

@shlomitk1 shlomitk1 Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scope: workload
I think the scope should be per template (the entire construct of a service + plugins). Plugins are always per service.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied it from transform-config-module.yaml spec. it should be changed there too? IIUC the enablement of plugin is per service but applying them is per asset. @simanadler

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying plugins is beyond the scope. It is done internally by the service.

Copy link
Author

@revit13 revit13 Aug 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shlomitk1 From what I understood applying the plugin is done by the blueprint-controller.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blueprint controller deploys the plugins. If 10 different assets need to have a column redacted, the redact plugin will be deployed only once and called 10 times by the service. A config module will contain the specifics per an asset (action to perform, arguments,...)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shlomitk1 - regarding - A config module will contain the specifics per an asset (action to perform, arguments,...) - what should be the config module in this case? is it what we send to arrow-flight-config-module? Thanks

Copy link
Author

@revit13 revit13 Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assume the following what scope means:

Scope of service/deployment module:

Asset - deployment per asset
Cluster - deployment per cluster
Workload- deployment per workload

Scope of config module depends on the scope of the service it configures :if the service is per asset than the config module will be per asset otherwise the config module will be per workload for each cluster. (assuming we do not do cross workload deployment of config modules)

plugin: per service.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assume the following what scope means:

Scope of service/deployment module:

Asset - deployment per asset
Cluster - deployment per cluster
Workload- deployment per workload

Yes, agreed

Scope of config module depends on the scope of the service it configures :if the service is per asset than the config module will be per asset otherwise the config module will be per workload for each cluster. (assuming we do not do cross workload deployment of config modules)

So you are saying that a config module will never be used across workloads. I think it makes sense, certainly for now. We may want to revisit it in the future.

plugin: per service.

Makes sense. @roee88 @cdoron Please confirm

object: inventory.parq
dataformat: parquet
- assetId: "m4d-notebook-sample/paysim-step2" # the sink of step 2, consumed by step 1
cluster: thegreendragon

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster is per step or flow, not per asset.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed what was there before... @simanadler

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@revit13 @shlomitk1 Originally there was no cluster in the assets. @revit13 I believe you suggested adding it as an indication of the cluster in which the data resides. In the steps the cluster is used to determine where the module should run. Same term different meanings. Maybe in assets we should call it storageCluster?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simanadler For blueprints it is crucial to know the deployment cluster. But why would we need to know where an asset is located? It can be on some server outside the deployed clusters, for all we know.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @shlomitk1 here. The storage cluster is a nice to have information for the plotter for debugging and may not be in any cluster (Or may be cross regional (e.g. COS)). This information is crucial for the optimizer etc which should be done at this stage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shlomitk1 and @froesef I agree with you

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK will remove the cluster from asset list

Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
Copy link
Collaborator

@simanadler simanadler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments

parameters:
source:
assetId: "m4d-notebook-sample/paysim"
api:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear why you are removing the api from the plotter. We removed it from the blueprint because it's not necessary for deployment. However, the plotter should contain everything needed to understand the flow of data between the components. We need to include via what (data source, api, ...) the data may be accessed/written

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simanadler can you please elaborate why it's not needed for the deployment? The deployment needs to know the protocol of the service that e.g. a module exposes. (Capabilities in a module are an array and thus multiple different interfaces could be supported by one module)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deployment needs to know the chart image and the values to use. This information is passed via module arguments.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. A chart and it's values are in the end needed for the deployment. My question is where are these values (that say it's an arrow-flight service) coming from? If they are not defined in the plotter how should they be put into the helm chart?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values have been defined in the asset list as a new type of connection. According to this approach, arrow-flight service does not declare how it can be accessed but rather any module that needs to access the arrow-flight service defines what it needs: "data source: format = arrow, protocol=arrow-flight, endpoint=arrow-flight-service:80, assetID=user-asset".

capabilities:
- capability: read
scope: cluster
scope: workload
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@revit13 This config module is configuring a cluster level service. That's why it's scope was cluster. It does of course pass data to that service that is for a particular workload. @froesef @roee88 Which do you think it should be?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rethinking that I delete this change... IIUC the config module is deployed per cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, cluster scope means that something is deployed once per cluster, i.e. cross-workloads. This is not our case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shlomitk1 It's not now. The question is should it be? If we have 15k workloads do we really need 15k of these config modules?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simanadler We need to configure the service, and the configuration differs from one workload to another. It is not a "global" service that can be accessed by the workload.

object: inventory.parq
dataformat: parquet
- assetId: "m4d-notebook-sample/paysim-step2" # the sink of step 2, consumed by step 1
cluster: thegreendragon
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@revit13 @shlomitk1 Originally there was no cluster in the assets. @revit13 I believe you suggested adding it as an indication of the cluster in which the data resides. In the steps the cluster is used to determine where the module should run. Same term different meanings. Maybe in assets we should call it storageCluster?

cluster: thegreendragon
connection:
arrow-flight:
endpoint: app1-ns1-arrow-transform.m4d-blueprints
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@revit13 Do we know what the endpoint will be prior to deploying the blueprints? The endpoint will differ per cluster I assume. I don't see an indication of that in the example.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know the endpoint prior to deploying the blueprint. It is based on 1) cluster 2) release name 3) module specifics defined in FybrikModule.

Signed-off-by: Revital Sur <eres@il.ibm.com>
kind: M4DModule
chart:
name: ghcr.io/mesh-for-data/m4d-arrow-flight-transform-conf:0.1.0
capabilities:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add the capabilities here? Capabilities are used in the module decision process which is finished at this stage because this is the output of it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to know the scope in order to determine the number of instances to deploy. We need to know the context (this module is used for read although it can also write) in order to construct the arguments: source for the read module, sink for the write, etc.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're using this to choose which capability of a module to use can this field be a struct and not an array?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be a struct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a chart has multiple capabilities and different capabilities are used in different steps then the chart will appear several times each time with the capability that the step needs, is that correct?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

parameters:
source:
assetId: "m4d-notebook-sample/inventory"
api:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the api field removed? Where in this current plotter is defined that the read service should offer the given asset using the arrow-flight protocol with a given asset id?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read access here is done by the workload. Nothing is done in the deployment.
In a different scenario where another module reads the data from the read service, the module gets this information in its params (e.g. source).

arrow-flight:
endpoint: app1-ns1-arrow-transform.m4d-blueprints
assetId: "m4d-notebook-sample/paysim" # always the same as the assetId known to the user (assetId or advertisedAssetId)
- assetId: "m4d-notebook-sample/paysim-step1" # used by workload to read the data
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having all the intermediate "virtual" assets in here is misleading. These are not assets that are specified by the user. And also in order to know what the steps are really doing one have to look up the assets in the top of the plotter. I specifically wanted to define everything within the step because it improves readability from the step point of view.

revit13 and others added 4 commits August 17, 2021 10:22
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Florian Froese <ffr@zurich.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
bucket: srcbucket
object: paysim.parq
dataformat: parquet
actions:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a comment how the read-module should know that the ghcr.io/mesh-for-data/m4d-arrow-flight-transform-conf:0.1.0 chart was chosen to configure this action as a plugin?

Signed-off-by: Revital Sur <eres@il.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants