Skip to content

Milestones

List view

  • From our goals discussion: Simple sample data To get a good sense of a data set an analyst is likely to want to take a peak at what the data set looks like. In normal SQL the equivalent query would be something like: SELECT * FROM table LIMIT 10 Using an anonymizing system such as Aircloak Insights issuing a query like the above yields little of value. All rows inevitably end up being identifying and the data gets anonymized away. From our first goal of providing statistics over the individual columns we have a lot of knowledge of what the data in the data set looks like. With some additional effort we can get beyond the statistical properties and also start producing values that visually resemble the original data. For example we might capture such properties as that: A numerical column for example always contain two decimal places and represent monetary amounts. It might also be the case that cent values such as 00, 49, 50, 75, and 99 occur with above average frequencies. A categorical string column follows a certain pattern such as that of a social security number A categorical column contains email addresses The goal of this phase is to be able to produce small data sets of around 10 rows with values that are within range and of the correct type for the corresponding columns. Furthermore we want the data to superficially resemble the underlying data, and where possible also capture simple column dependencies. For example if we have a column for the car make and a column for the model name, then it should be possible to capture that Tesla and Cybertruck belong together as well as Ford and F100 and avoid generating pairings such as Tesla and F100.

    No due date
    19/20 issues closed
  • The aim of this MVP is to get a system in place that is ready to be used from Aircloak Insights. It is not expected that the statistics produced are perfect, or for that matter that all column types can be processed. However it is expected that: - the system is available as a docker container that can easily be deployed - basic statistics for numeric and categorical text columns can be produced

    Due by January 31, 2020
    9/9 issues closed