Skip to content

Simple project for efficient data creation, parsing, saving and loading, including a robust testing framework.

Notifications You must be signed in to change notification settings

deniderveni/SimpleDataSaver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Author: Roden Derveni, 2025

This package is built as a simple data parser, for reading and writing data of a spcecific format to .json.

This was built as a project for a coding test. To avoid issues with authorship of provided functions, and to hide this 'solution' from future potential candidates if this task is reused, various parameters have been obscured and functions changed

In Short:

From a unix-based terminal (Linux/OSX/WSL/VSCode terminal)

To download and install:

  • git clone git@github.com:deniderveni/SimpleDataSaver
  • cd SimpleDataSaver/
  • pip install -e .

To run unit tests:

  • python run_unit_tests.py

To run core functions:

  • python -m datasaver.DataSaver_run

To otherwise import to another script:

  • import datasaver or from datasaver import [specifc modules]

Details

File heirarchy

DataSaver
├── datasaver
│   ├── DataSaver.py
│   ├── DataSaver_run.py
│   └── __init__.py
├── pyproject.toml
├── README.md
├── run_unit_tests.py
└── unit_tests
    ├── conftest.py
    ├── DataSaver_test.py
    └── __init__.py

Content summary

  • pyproject.toml : Sets up project build instructions and dependencies
  • run_unit_tests.py : Runs groups of unit tests for different scenarios

datasaver/ Containing the main functions for the project

  • DataSaver.py : Contains all critical functions and necessary library imports for the project
  • DataSaver_run.py : Runs DataSaver. Select an option for which funtionality is desired - generating data, or loading data

unittests/ Containing unit testing functions for some sample data - uses pytest

  • conftest.py : Sets up the unit testing functions and data
  • DataSaver_test.py : Runs unit testing assertions for several small tests regarding data saving and loading

Requirements

This package was developed with:

Python     3.12.6
numpy      2.1.2
pandas     2.2.3
pytest     8.3.5
setuptools 74.1.2

Under the following environment (unlikely to make any difference):

miniconda 24.7.1
pip       24.2
Ubuntu    24.04.2 LTS

It is likely that newer versions and some older versions of these packages would work, but this has not been tested

How to run

Unit Tests

The unit tests contain 7 tests regardining file saving, loading and data integrity.

Once installed, a set of tests may be run with: pytest unit_tests/

Two built-in flags are recommended for use here: - -v: For increased verbosity, showing the individual tests succeeding/failing - --cov=datasaver: To show the pytest code coverage i.e.: pytest --cov=datasaver -v unit_tests/

Two custom flags are also available for stress-testing the unit tests: --no-flush Keep generated test files --use-random Use randomised test data

By default, the unit test will flush its data directory. --no-flush will suppress this behaviour, so that persistency behaviour can be inspected By default, some fixed, simple data is used. The data behaviour is dependent on the metadata specifically. For now, --use-random will simply change the date-time stamp in the metadata, but this behaves as new data to the package.

The helper script run_unit_tests.py runs three sets of unit tests:

  • The default behaviour
  • With --use-random
  • With --use-random and --no-flush

Giving a full stress test of the 3 sets of behvaiours.

The helper script will return a successful message if all sub-tests work, or a failure if anything fails. The pytest information is also present in the test output.

This can be run with: python run_unit_tests.py

Main Software - DataSaver

Once installed, the main package may be simply run with: - python -m datasaver.DataSaver_run

An additional flag --verbose allows for users to optionally print out the saved/loaded data to the terminal. i.e." - python -m datasaver.DataSaver_run --verbose

This will immediately prompt with 3 options: 1 - Generate and save data 2 - Load data 3 - Exit

Entering '1':

  • This will use the existing generating function to create some data, and promp the user for any additional metadata.

    • At this point in time, endless metadata can be requested but this must be entered manually every time.
    • Simply pressing the Enter/Return key will cancel the request and continue with the standard metadata
  • The data will be generated and saved to 3 set:

    • A hashsum for the metadata will be created, this is appended to a new data/hashsums.csv file, along with a timestamp
    • The metadata will be stored as a .json file in data/metadata/metadata_{hashsum}.json
    • The 'measurement' data will be stored as a .json file in data/readout/metadata_{hashsum}.json

Entering '2':

  • If this program is being run for the first time and/or no saved data exists in the correct format, this will raise an error and also close the code.
  • Otherwise, this will read data/hashsums.csv and give the user a list of available hashsums
  • The user may copy and paste the hashsum in the entry. A valid entry will load the data to the program.

Entering '3':

  • Will immediately exit the code.

DataSaver output

As explained above, by default data will be saved to a new data/ folder. data/metadata/ will contain .json files with all associated metadata for a particular measurement + hashsum combination.

  • Note that the hashsum is only dependent on the metadata. Unless two measurements are made at identical times, this would not produce duplicate hashsums. However, if this is a possibility, the hashsum can either also be based on the data (which could take a very long time if there is a lot of data), or a more robust and simple method would be to iteratively add to the hashsum.

The main data is separately stored in data/readout/ in the following format. I also made a design choice to replace the checkX parameters in the data output to the expected parameter directly, so that the results return in: ``` { check: { A : {...}, B : {...} }, meas: { time : {...} } } Instead of check : {check1 : ...}. Althoguh this could be reverted if necessary.

The B method keeps track of the hashsums in data/hashsums.csv, which is used as a look-up table when loading data.

Notes:

  • DataSaver*.py files are developed so that their functions can be loaded as a library elsewhere if desired

  • The .json file format allows great flexibility for adding metadata and N-dimensional data. With standard Python lists, non-contiguous arrays will never be an issue, however with the expected numpy.array() this will be problematic as it requires explicit i * j * k [* ...] sizing. There are possible solutions depending on the desired outcomes.

    • This has also been tested with 2-dimensional result data, by creating an np.meshgrid() to write. N-dimensional data is presented as nested lists in Python and JSON.
    • This is expected to be extendible to N dimensions, but will take up drastically more space if a human-readable format is required
  • However, the .json format is not cost-effective when continuously adding new data to the file. Therefore, I made the decision to use the metadata's hashsum as an identifying 'key'.

    • In this case, instead of a key within the .json file, I use it as a file name for the metadata and data files.
    • By storing this key in .csv file, the read-in time for the .csv file becomes linear and loading time of the desired dataset is O(1) - i.e. there's no need to load the whole datasets when we only want one. This avoids the computational overhead within Python, but adds some minute disk overhead for storing multiple .json files. It also looks a little messy, and relying on filenaming is not ideal...
  • The provided helper function Generate_measurement_data() has been minimally modified for this task:

    • in the Case1 case, the check key has been changed to check1. This is to unify the methods for parsing check data in all cases, where multiple checkX parameter exists.
    • In all cases, meas has been changed to meas1. This is for the same reason as above, where although no cases exist with multiple measurements, this method now supports additional measX parameters.
  • There are also some TODO:s sprinkled throghout. These are minor improvements or caveats that I think would be good to include for robustness eventually

About

Simple project for efficient data creation, parsing, saving and loading, including a robust testing framework.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages