This document describes all supported formats for importing pre-computed data profiling results into Metis.
Data profiles are defined in the data loader config (e.g., adult.json) under the data_profiles key. Each task type supports two import methods:
- Inline values (
values): Define results directly in the JSON config - External file (
file): Reference a CSV or TXT file with the results
{
"loader": "CSV",
"name": "Adult",
"file_name": "adult.csv",
"data_profiles": {
"<task_name>": {
"source": "<source_identifier>",
"file": "path/to/file.csv", // OR
"values": [...] // inline values
}
}
}JSON inline:
"null_count": {
"source": "manual",
"values": [
{"column": "age", "value": 150},
{"column": "income", "value": 42}
]
}CSV file:
column,value
age,150
income,42JSON inline:
"value_length_max": {
"source": "manual",
"values": [
{"column": "name", "value": 50},
{"column": "email", "value": 100}
]
}CSV file:
column,value
name,50
email,100JSON inline:
"constancy": {
"source": "manual",
"values": [
{"column": "status", "value": 0.85}
]
}CSV file:
column,value
status,0.85Values are auto-detected as int, float, bool, or string.
JSON inline:
"most_frequent_value": {
"source": "manual",
"values": [
{"column": "status", "value": "active"},
{"column": "count", "value": 42}
]
}CSV file:
column,value
status,active
count,42JSON inline:
"quartiles": {
"source": "manual",
"values": [
{"column": "age", "Q1": 25.0, "Q2": 35.0, "Q3": 50.0},
{"column": "income", "Q1": 30000, "Q2": 50000, "Q3": 80000}
]
}CSV file:
column,Q1,Q2,Q3
age,25.0,35.0,50.0
income,30000,50000,80000JSON inline:
"equi_width_histogram": {
"source": "manual",
"values": [
{
"column": "age",
"bins": [
{"min": 0, "max": 30, "count": 1500},
{"min": 30, "max": 60, "count": 2200},
{"min": 60, "max": 90, "count": 800}
]
}
]
}CSV file:
column,bin_min,bin_max,count
age,0,30,1500
age,30,60,2200
age,60,90,800
income,0,50000,3000
income,50000,100000,2500JSON inline:
"basic_type": {
"source": "manual",
"values": [
{"column": "age", "value": "numeric"},
{"column": "name", "value": "alphabetic"}
]
}CSV file:
column,value
age,numeric
name,alphabeticValid values for basic_type: numeric, alphabetic, alphanumeric, date, time, mixed, empty
Valid values for data_type: boolean, smallint, int, bigint, numeric, double, date, time, timestamp, varchar, text
Valid values for data_class: code, indicator, text, date/time, quantity, identifier
Valid values for domain: email, url, ssn, date_iso, time, ip_address, zip_code, credit_card, phone, currency, first_name, last_name, full_name, city, state, country, address, postal_code, unknown
JSON inline:
"size": {
"source": "manual",
"values": [
{"column": "price", "value": 10}
]
}CSV file:
column,value
price,10JSON inline:
"patterns": {
"source": "manual",
"values": [
{
"column": "phone",
"patterns": [
{"pattern": "999-999-9999", "count": 500, "frequency": 0.8},
{"pattern": "(999) 999-9999", "count": 100, "frequency": 0.16}
]
}
]
}CSV file:
column,pattern,count,frequency
phone,999-999-9999,500,0.8
phone,(999) 999-9999,100,0.16Pattern codes: A=uppercase, a=lowercase, 9=digit, #=special char, ?=other letter, =space
JSON inline:
"jaccard_similarity": {
"source": "manual",
"values": [
{"column1": "name", "column2": "alias", "value": 0.85}
]
}CSV file:
column1,column2,value
name,alias,0.85JSON inline:
"jaccard_similarity_ngrams": {
"source": "manual",
"values": [
{"column1": "name", "column2": "alias", "n": 2, "value": 0.78}
]
}CSV file:
column1,column2,n,value
name,alias,2,0.78Not importable (returns MinHash objects).
"fd": {
"source": "manual",
"values": [
{"lhs": ["zip"], "rhs": "city"},
{"lhs": ["id"], "rhs": "name"},
{"lhs": ["city", "street"], "rhs": "zip"}
]
}"fd": {
"source": "hyfd",
"file": "outputs/adult_hyfd.txt"
}HyFD/AIDFD output format:
[table.col1]->table.col2
[table.col1, table.col2]->table.col3
Each FD on the same line, space-separated. Table prefix is stripped automatically.
"fd": {
"source": "cfdfinder",
"file": "outputs/adult_cfd.txt"
}CFDFinder output format:
[table.col1]->table.col2#(pattern1);(pattern2)
The pattern tableau is stored as additional metadata.
JSON inline:
"ucc": {
"source": "manual",
"values": [
{"columns": ["id"]},
{"columns": ["name", "birthdate"]}
]
}CSV file:
columns
id
"name,birthdate"JSON inline:
"ind": {
"source": "manual",
"values": [
{
"dependent": ["customer_id"],
"referenced": ["id"],
"referenced_table": "customers"
}
]
}CSV file:
dependent,referenced,referenced_table
customer_id,id,customers
"order_id,product_id","id,id","orders,products"The source field tracks where the data came from:
| Source | Description |
|---|---|
manual |
Manually entered values |
hyfd |
HyFD algorithm output |
aidfd |
AIDFD algorithm output |
cfdfinder |
CFDFinder algorithm output |
computed |
Computed by Metis (automatic) |
imported:<tool> |
Custom import source |
{
"loader": "CSV",
"name": "Adult",
"file_name": "adult.csv",
"data_profiles": {
"fd": {
"source": "hyfd",
"file": "outputs/adult_hyfd.txt"
},
"null_count": {
"source": "manual",
"values": [
{"column": "age", "value": 0},
{"column": "workclass", "value": 1836}
]
},
"equi_width_histogram": {
"source": "manual",
"file": "outputs/adult_histograms.csv"
},
"basic_type": {
"source": "manual",
"values": [
{"column": "age", "value": "numeric"},
{"column": "education", "value": "alphabetic"}
]
}
}
}