code-challenge-plugin/plugin.proto at master · naveego/code-challenge-plugin · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
syntax = "proto3";
package plugin;

// The Plugin service is implemented by plugins which are responsible for discovering
// and publishing data. In this challenge the plugin will target CSV files, but Naveego
// writes plugins against many different kinds of data sources.
//

// One difficulty in creating a unified interface for interacting with multiple data sources
// is that the configuration settings needed to connect to data sources can differ radically.
// Naveego handles this by having each plugin define a JSONSchema for its settings,
// then dynamically rendering a form based on that schema which the user is asked to fill in.
// The resulting JSON object is then sent to the plugin for validation. In this challenge,
// we'll keep things simple and just hardcode the settings the plugin needs.
//
// Another complication is that data sources differ in the degree to which their schemas
// are discoverable. A SQL database usually provides metadata about the structure of tables,
// but a NoSQL database may not have any structure defined, and a collection of CSV files
// may provide a minimal structure by having headers, but with no type information.
// We handle that variation by providing a UI where a user can strongly specify the available
// schemas in collaboration with the plugin. The user provides some basic connection or
// selection information - such as a database name or a file path - and the plugin uses that
// to do its best to discover the schemas of the available data. We refer to this as the "discovery"
// phase. The user can then look at the discovered schemas, usually along with a sample of the
// data from the data source, and can assign types (like number, date, or boolean) to the
// properties of the schema, as well as annotating the properties with clear names and descriptions.
// The schemas authored by technical users are then made available to business users who can use
// them to construct their business-level data collection and merge flows.
//
// To help users, we'd like to be able to heuristically infer the types of properties that are
// discovered in data sources which do not provide strong type information. If you find
// implementing a passing solution not challenging enough, try adding some type inference logic
// which will examine the values in the data source in order to specify the types without user
// intervention. This may be tricky, since some records may have invalid data. If you've provided
// inferred types during discovery, you can mark records with bad data as invalid, but you
// should still publish the record.

service Plugin {
    // The Discover method is responsible for taking the provided settings
    // and using them to find and describe all the schemas which the settings make available.
    // In this case, the plugin will look for CSV files which match a pattern.
    rpc Discover (DiscoverRequest) returns (DiscoverResponse) {
    }

    // The Publish method is responsible for collecting all the records
    // which belong to a single schema and streaming them back to the host.
    // The schema which is passed in will be one of the schemas returned by the
    // Discover method, so you can share data between Discover and Publish by
    // means of the `settings` string on the Schema message.
    rpc Publish (PublishRequest) returns (stream PublishRecord) {
    }
}

// The request message containing the user's name.
message DiscoverRequest {
    // In a real plugin the settings would be conveyed in a JSON object
    // with a schema defined by the plugin and populated by a user through a UI.
    // For this challenge we'll define a message for the settings
    // to make it easier for the host to create the settings.
    Settings settings = 1;
}

message Settings {
    // This a glob specifying the pattern to use to find files.
    // This will be an absolute path something like /src/data/*/*.csv.
    // The plugin should find all files matching the pattern, then
    // analyze them to find the unique schemas among them (multiple files
    // may have the same schema).
    //
    // For this challenge you can assume that all CSV files have a header row,
    // and that all files are comma delimited
    string fileGlob = 1;
}

message DiscoverResponse {
    // Array of schemas discovered.
    repeated Schema schemas = 1;
}

message Schema {
    // The unique name of the schema; if there is no unique name
    // the plugin can generate one.
    string name = 1;
    // The settings the plugin will use for publishing when
    // this schema is included in a PublishRequest. This
    // can be any data the plugin wants to capture; the host
    // will treat it as an opaque blob.
    // Hint: this is a good place to store the file paths of all the
    // files which contain records with this schema.
    string settings = 2;
    // Array of the properties discovered for this schema,
    // in the order they have in the CSV.
    repeated Property properties = 3;
}

message Property {
    // Name of the property, from the column header.
    string name = 1;
    // Type of the property; can be "string", "integer", "number", "datetime", "boolean"
    // This should be inferred if possible by analyzing the data.
    // This is an optional part of the challenge; you can pass the tests
    // without populating this field.
    string type = 2;
}

message PublishRequest {
    // The settings will be the same as the settings sent to the Discover method.
    Settings settings = 1;
    // The schema will be one of the schemas returned by the Discover method.
    Schema schema = 2;
}

message PublishRecord {
    // This should be set to true if the record is not valid
    // because it violates the inferred schema in some way.
    // The invalid property should be written as `null` in data.
    bool invalid = 1;

    // If the record is invalid this field should explain why.
    // This should include the property name and the original value.
    // This is a user-directed (as opposed to machine-readable) field,
    // so you can format it however you want.
    string error = 2;
    // Data contains the values for a single record.
    // The values should be provided as a JSON serialized array.
    // For example, if the CSV has a row
    // 17,Alabama,true
    // then the data field should contain the string
    // "[17,\"Alabama\",true]"
    // however, if you have not inferred the types of the properties,
    // it's OK to make everything a string, which would be
    // "[\"17\",\"Alabama\",\"true\"]"
    string data = 3;
}