schema driven processing language
SDPL introduces data schema to major data processing languages
such as Apache Pig, Spark and Hive. SDPL supports generic operations such as
LOAD, STORE, JOIN, PROJECT, while complex transformation and fine-tuning
are performed in the target language via quotation.
SDPL links 3 artifacts:
- DataRepository file describes data source and credentials needed to access it
- Schema file describes the data
- Source code describes what data to load and transformation to apply
Supported target languages are Apache Pig and Spark; DataRepository is a short YAML file; Schema could be read from SDPL YAML, AVRO and Protobuf formats
Main repository: https://bitbucket.org/mushkevych/sdpl
Mirror: https://github.com/mushkevych/sdpl
-
Python3.5+
-
antlr4 package
sudo apt-get install antlr4 -
antlr4-python3-runtime
$> pip install antlr4-python3-runtime -
PyYAMP
$> pip install PyYAML -
Avro
$> pip install avro-python3 -
Protobuf
$> pip install protobuf
`$> antlr4 -Dlanguage=Python3 sdpl.g4`
$> python3 sdpl.py pig -i tests/snippet_1.sdpl