This package is built as a simple data parser, for reading and writing data of a spcecific format to .json.
This was built as a project for a coding test. To avoid issues with authorship of provided functions, and to hide this 'solution' from future potential candidates if this task is reused, various parameters have been obscured and functions changed
From a unix-based terminal (Linux/OSX/WSL/VSCode terminal)
To download and install:
git clone git@github.com:deniderveni/SimpleDataSavercd SimpleDataSaver/pip install -e .
To run unit tests:
python run_unit_tests.py
To run core functions:
python -m datasaver.DataSaver_run
To otherwise import to another script:
import datasaverorfrom datasaver import [specifc modules]
DataSaver
├── datasaver
│ ├── DataSaver.py
│ ├── DataSaver_run.py
│ └── __init__.py
├── pyproject.toml
├── README.md
├── run_unit_tests.py
└── unit_tests
├── conftest.py
├── DataSaver_test.py
└── __init__.py
- pyproject.toml : Sets up project build instructions and dependencies
- run_unit_tests.py : Runs groups of unit tests for different scenarios
datasaver/ Containing the main functions for the project
- DataSaver.py : Contains all critical functions and necessary library imports for the project
- DataSaver_run.py : Runs DataSaver. Select an option for which funtionality is desired - generating data, or loading data
unittests/ Containing unit testing functions for some sample data - uses pytest
- conftest.py : Sets up the unit testing functions and data
- DataSaver_test.py : Runs unit testing assertions for several small tests regarding data saving and loading
This package was developed with:
Python 3.12.6
numpy 2.1.2
pandas 2.2.3
pytest 8.3.5
setuptools 74.1.2
Under the following environment (unlikely to make any difference):
miniconda 24.7.1
pip 24.2
Ubuntu 24.04.2 LTS
It is likely that newer versions and some older versions of these packages would work, but this has not been tested
The unit tests contain 7 tests regardining file saving, loading and data integrity.
Once installed, a set of tests may be run with:
pytest unit_tests/
Two built-in flags are recommended for use here:
- -v: For increased verbosity, showing the individual tests succeeding/failing
- --cov=datasaver: To show the pytest code coverage
i.e.:
pytest --cov=datasaver -v unit_tests/
Two custom flags are also available for stress-testing the unit tests: --no-flush Keep generated test files --use-random Use randomised test data
By default, the unit test will flush its data directory. --no-flush will suppress this behaviour, so that persistency behaviour can be inspected
By default, some fixed, simple data is used. The data behaviour is dependent on the metadata specifically. For now, --use-random will simply change the date-time stamp in the metadata, but this behaves as new data to the package.
The helper script run_unit_tests.py runs three sets of unit tests:
- The default behaviour
- With
--use-random - With
--use-randomand--no-flush
Giving a full stress test of the 3 sets of behvaiours.
The helper script will return a successful message if all sub-tests work, or a failure if anything fails.
The pytest information is also present in the test output.
This can be run with:
python run_unit_tests.py
Once installed, the main package may be simply run with:
- python -m datasaver.DataSaver_run
An additional flag --verbose allows for users to optionally print out the saved/loaded data to the terminal. i.e."
- python -m datasaver.DataSaver_run --verbose
This will immediately prompt with 3 options:
1 - Generate and save data
2 - Load data
3 - Exit
Entering '1':
-
This will use the existing generating function to create some data, and promp the user for any additional metadata.
- At this point in time, endless metadata can be requested but this must be entered manually every time.
- Simply pressing the Enter/Return key will cancel the request and continue with the standard metadata
-
The data will be generated and saved to 3 set:
- A hashsum for the metadata will be created, this is appended to a new
data/hashsums.csvfile, along with a timestamp - The metadata will be stored as a .json file in
data/metadata/metadata_{hashsum}.json - The 'measurement' data will be stored as a .json file in
data/readout/metadata_{hashsum}.json
- A hashsum for the metadata will be created, this is appended to a new
Entering '2':
- If this program is being run for the first time and/or no saved data exists in the correct format, this will raise an error and also close the code.
- Otherwise, this will read
data/hashsums.csvand give the user a list of available hashsums - The user may copy and paste the hashsum in the entry. A valid entry will load the data to the program.
Entering '3':
- Will immediately exit the code.
As explained above, by default data will be saved to a new data/ folder.
data/metadata/ will contain .json files with all associated metadata for a particular measurement + hashsum combination.
- Note that the hashsum is only dependent on the metadata. Unless two measurements are made at identical times, this would not produce duplicate hashsums. However, if this is a possibility, the hashsum can either also be based on the data (which could take a very long time if there is a lot of data), or a more robust and simple method would be to iteratively add to the hashsum.
The main data is separately stored in data/readout/ in the following format.
I also made a design choice to replace the checkX parameters in the data output to the expected parameter directly, so that the results return in:
```
{
check: {
A : {...},
B : {...}
},
meas: {
time : {...}
}
}
Instead of check : {check1 : ...}. Althoguh this could be reverted if necessary.
The B method keeps track of the hashsums in data/hashsums.csv, which is used as a look-up table when loading data.
-
DataSaver*.pyfiles are developed so that their functions can be loaded as a library elsewhere if desired -
The
.jsonfile format allows great flexibility for adding metadata and N-dimensional data. With standard Python lists, non-contiguous arrays will never be an issue, however with the expectednumpy.array()this will be problematic as it requires explicit i * j * k [* ...] sizing. There are possible solutions depending on the desired outcomes.- This has also been tested with 2-dimensional result data, by creating an
np.meshgrid()to write. N-dimensional data is presented as nested lists in Python and JSON. - This is expected to be extendible to N dimensions, but will take up drastically more space if a human-readable format is required
- This has also been tested with 2-dimensional result data, by creating an
-
However, the
.jsonformat is not cost-effective when continuously adding new data to the file. Therefore, I made the decision to use the metadata's hashsum as an identifying 'key'.- In this case, instead of a key within the
.jsonfile, I use it as a file name for the metadata and data files. - By storing this key in
.csvfile, the read-in time for the.csvfile becomes linear and loading time of the desired dataset is O(1) - i.e. there's no need to load the whole datasets when we only want one. This avoids the computational overhead within Python, but adds some minute disk overhead for storing multiple .json files. It also looks a little messy, and relying on filenaming is not ideal...
- In this case, instead of a key within the
-
The provided helper function
Generate_measurement_data()has been minimally modified for this task:- in the
Case1case, thecheckkey has been changed tocheck1. This is to unify the methods for parsingcheckdata in all cases, where multiplecheckXparameter exists. - In all cases,
meashas been changed tomeas1. This is for the same reason as above, where although no cases exist with multiple measurements, this method now supports additionalmeasXparameters.
- in the
-
There are also some
TODO:s sprinkled throghout. These are minor improvements or caveats that I think would be good to include for robustness eventually