WIP: Bootstrap relational database from fixtures #775

MarekSuchanek · 2025-09-22T07:01:55Z

Description

A new feature to support bootstrapping the relational database (postgresql) from JSON fixtures.

Also see follow-up #785

Checklist

TODO:

remove dev migrations from src/main/resources/dev/db/migration (add separate set of dev fixtures, or just use the default ones?) (Use JSON fixtures instead of SQL migrations in development profile #790)
allow specification of multiple resource directories (Enable loading of relational db fixtures from multiple directories #793)
update readme with bootstrapping instructions (Add readme section about initial data (bootstrapping) #794)
add fixture history (Add history for relational database fixtures #797)
adapt factory reset so it does not duplicate schemas (also see todos in source) (Factory reset from fixtures #812)
add tests for populating db from fixtures
remove sql data migrations from test (Bootstrap test data from JSON fixtures instead of SQL migrations #802)
javadoc where missing and needed

src/main/java/org/fairdatapoint/config/properties/BootstrapProperties.java

dennisvang · 2025-09-23T11:51:28Z

Hi @MarekSuchanek there's a typo in the package name: boostrap. Not sure if this is on purpose, e.g. to facilitate refactoring, so I've created #778 instead of pushing directly onto this branch. Could you have a look?

dennisvang · 2025-09-23T12:00:46Z

@MarekSuchanek some general comments/requests:

A few lines (or more) of explanation for the major classes would be very helpful (e.g. javadoc comments).
Some (unit)tests would be helpful

src/main/java/org/fairdatapoint/service/bootstrap/components/AbstractBootstrapper.java

src/main/java/org/fairdatapoint/service/bootstrap/components/MembershipBootstrapper.java

src/main/java/org/fairdatapoint/service/bootstrap/components/AbstractBootstrapper.java

src/main/java/org/fairdatapoint/service/bootstrap/components/MetadataRecordsBootstrapper.java

data/records/repository.ttl

dennisvang · 2025-09-23T15:39:02Z

Another general question:
Are users (in this context: people setting up an FDP) supposed to be able to override and/or extend all the files in the data dir? Or are some of them only for internal use and not supposed to be overridden (perhaps the _schemas)?

src/main/java/org/fairdatapoint/service/bootstrap/components/MetadataRecordsBootstrapper.java

dennisvang · 2025-09-24T08:57:33Z

@MarekSuchanek Another general question:

Looks like the fixture classes in service/boostrap/fixtures are used in a way very similar to DTOs (although semantically different).
However, do we really need these additional fixture classes and the corresponding mappers?
Shouldn't a fixture be an exact representation of an entity anyway (i.e. of the data in the entity's db table)?
I would expect to see something like spring data repository-populators instead of the fixture classes and mappings.

This also relates to the way the fixtures are stored, e.g. as separate files for each object vs a single file with an array of objects.

dennisvang

Hi @MarekSuchanek,

This looks pretty good, but I've got the feeling it can be simplified considerably.
Perhaps you could take a look at the detailed comments so far and clarify some things?

To summarize:

Please provide explanatory comments in the code, explaining what things are intended to do and especially why they are done in a certain way (at least indicating the patterns used)
What would be the general workflow for overriding/extending default fixtures (using docker)?
Could we use spring data repository populators, so we don't need separate fixture classes and corresponding mapper methods? (basically loading fixtures directly into entities)
Related to the above, there's the option to define multiple related fixture objects in a json array, as opposed to separate files for each object.
Related entities should get their own fixtures.
Would it not be possible to use a single bootstrapper for all fixtures? The trick should be the same every time, except maybe for the records (which use the triple store).

It would be great if you could clarify this.

Also could you check if we can merge #778?

src/main/java/org/fairdatapoint/service/user/UserMapper.java

src/main/java/org/fairdatapoint/service/bootstrap/fixtures/UserFixture.java

data/users/albert-einstein.json

src/main/java/org/fairdatapoint/service/bootstrap/fixtures/UserFixture.java

src/main/java/org/fairdatapoint/service/bootstrap/fixtures/MembershipFixture.java

data/membership/owner.json

data/membership/data-provider.json

src/main/java/org/fairdatapoint/service/bootstrap/components/MembershipBootstrapper.java

src/main/java/org/fairdatapoint/service/membership/MembershipMapper.java

MarekSuchanek · 2025-09-24T16:33:03Z

Hi @MarekSuchanek,

This looks pretty good, but I've got the feeling it can be simplified considerably. Perhaps you could take a look at the detailed comments so far and clarify some things?

To summarize:

Please provide explanatory comments in the code, explaining what things are intended to do and especially why they are done in a certain way (at least indicating the patterns used)

What would be the general workflow for overriding/extending default fixtures (using docker)?

Could we use spring data repository populators, so we don't need separate fixture classes and corresponding mapper methods? (basically loading fixtures directly into entities)

Related to the above, there's the option to define multiple related fixture objects in a json array, as opposed to separate files for each object.

Related entities should get their own fixtures.

Would it not be possible to use a single bootstrapper for all fixtures? The trick should be the same every time, except maybe for the records (which use the triple store).

It would be great if you could clarify this.

Also could you check if we can merge #778?

Thanks for the review, some of the things are mistakes (like the typo or null passed to objectMapper).

I can add comments to the code
The idea was to either replace /data folder or its subfolders, esp. to create new Docker Images (using FROM) for specific use cases... That is also why the goal was to have it only on the first start (when repos are empty).
I can try to switch to populators, my worry there is that you lose flexibility in handling the transformation and being bound to structure of DB entities (e.g. know the hashed password).
I actually had at first bootstrappers for API keys, saved search queries etc. but does it make sense to process them independently? In JSONs, you will need to create and manage more files, link them by some UUIDs which will not get used bcs those are generated on save, it can result to issues (e.g. user does not exist for API key). For membership permissions, it is even stranger to have permissions defined somewhere else. What would be the advantage here? Of course, if we switch to populators and get bound to DB entities, it will be like that.
The issue is I guess with links to others and different way of mapping...

MarekSuchanek · 2025-09-24T16:36:22Z

Another general question: Are users (in this context: people setting up an FDP) supposed to be able to override and/or extend all the files in the data dir? Or are some of them only for internal use and not supposed to be overridden (perhaps the _schemas)?

I agree , that _schemas dir is probably not in the correct place and it should be done differently. But it would be good to clarify how people should be able to override/extend this. Originally, this was intended for creating different "prepared" FDP Docker images or other setups that would contain different data at start.

dennisvang · 2025-09-25T15:55:47Z

Thanks for the quick response @MarekSuchanek ! 🙂

I can try to switch to populators, my worry there is that you lose flexibility in handling the transformation and being bound to structure of DB entities (e.g. know the hashed password).

Not sure about the general consensus, but in my opinion fixtures should be bound to the structure of DB entities.
I consider fixtures to be a form of raw data dump. The fewer mappings in between, the better. But opinions may differ.

I actually had at first bootstrappers for API keys, saved search queries etc. but does it make sense to process them independently? In JSONs, you will need to create and manage more files, link them by some UUIDs which will not get used bcs those are generated on save, it can result to issues (e.g. user does not exist for API key). For membership permissions, it is even stranger to have permissions defined somewhere else. What would be the advantage here? Of course, if we switch to populators and get bound to DB entities, it will be like that.

I think the advantage is uniformity. You can do the same trick every time, without needing special handling for objects with relations. Of course you do need to ensure relational integrity yourself when creating the fixture files. Also the order of fixture loading is important.

This is also how it is done in Python/Django, and in my experience that works really well.

W.r.t "manage more files", with populators you could have an array of objects, so fewer files. E.g. a single users.json file.

dennisvang · 2025-09-25T15:58:02Z

Another general question: Are users (in this context: people setting up an FDP) supposed to be able to override and/or extend all the files in the data dir? Or are some of them only for internal use and not supposed to be overridden (perhaps the _schemas)?

I agree , that _schemas dir is probably not in the correct place and it should be done differently. But it would be good to clarify how people should be able to override/extend this. Originally, this was intended for creating different "prepared" FDP Docker images or other setups that would contain different data at start.

Ok thanks. I guess it is also a matter of providing good documentation for this. :)

MarekSuchanek · 2025-10-13T03:00:34Z

@dennisvang I tried Populators with your example prototype and it would indeed simplify how to reach this functionality and make it more "universal". At the same time, two issues I spot and want to make sure it is OK or if we need to deal with that somehow:

password hashing = are we going to allow only passwordHash in fixtures or do we need password and have there some special treatment of the entity to create passwordHash (e.g. by having password as Transient and hash on persisting)?
SHACL in string = it is OK to have metadata schema SHACL definitions inside JSONs (= escaped) instead of linked RDF Turtle file, again with populators we would need some special logic
schemas = do we still need some JSON schemas for this or we can simply refer to the entity classes?
remember loaded files = we still want to remember what JSON files were loaded (with hash?), right?

dennisvang · 2025-10-13T10:11:48Z

@dennisvang I tried Populators with your example prototype and it would indeed simplify how to reach this functionality and make it more "universal". At the same time, two issues I spot and want to make sure it is OK or if we need to deal with that somehow:

Hi @MarekSuchanek thanks for giving the prototype a try. I'm sure a proper implementation would need a lot more work, but I would be happy to contribute to that.

password hashing = are we going to allow only passwordHash in fixtures or do we need password and have there some special treatment of the entity to create passwordHash (e.g. by having password as Transient and hash on persisting)?

I would prefer only to allow hashed passwords in the fixtures.

Of course we would need to provide some instructions for how to generate a hashed password.

Currently the FDP is configured to use bcrypt:

FAIRDataPoint/src/main/java/org/fairdatapoint/config/PasswordConfig.java

Line 35 in 8ad9994

return new BCryptPasswordEncoder();

Afaik bcrypt and argon2id hashes are fully portable, i.e. they contain all information needed for verification. For scrypt you would still need to know the correct parameters. In any case users should be able to use any (offline) tool to generate a hash. This can be done on the linux command line using the appropriate package. Languages and frameworks also have their solutions, such as the spring boot cli encodepassword command. There are online tools as well, but better not use those, I think.

Another alternative would be to fire up a local fdp, create the user(s) manually, then dump the resulting database content into a json file.

Note

While we're at it, perhaps the v2 release would also be a good opportunity to upgrade the hashing algorithm to argon2id, as I believe that is the current best practice. For example,from OWASP:

To sum up our recommendations:

Use Argon2id with [...]

If Argon2id is not available, use scrypt [...]

[...]
The bcrypt password hashing function should only be used for password storage in legacy systems where Argon2 and scrypt are not available.

For example we can apt install argon2 and then, as a quick example with bare minimum settings:

echo "mypassword" | argon2 applysomerandomsalthere -id -k 19456 -t 2 -p 1

Even better would be to use a DelegatingPasswordEncoder to make this future-proof.

Changing the password encoder would be a separate PR of course.

SHACL in string = it is OK to have metadata schema SHACL definitions inside JSONs (= escaped) instead of linked RDF Turtle file, again with populators we would need some special logic

I think it would be good to have a clear separation between the loading of relational db fixtures and RDF (Turtle) fixtures. The RDF would need a dedicated loader anyway, I suppose.

I discussed this with @luizbonino recently, and we were thinking of storing the actual "metadata schema SHACL definitions" directly in a separate repository in the triple store. The fixtures for the relational db could then just contain the URI of the metadata schema.

What are your thoughts on that?

schemas = do we still need some JSON schemas for this or we can simply refer to the entity classes?

Could you clarify?

remember loaded files = we still want to remember what JSON files were loaded (with hash?), right?

Actually now I'm not so sure this is necessary. If the fixtures include the unique identifier, then we can simply load them always, and they would always be present. Only thing is they should never be modified. If we do need to keep track of history, perhaps a data migration approach would be better.

EDIT: On second thought, you're right: Indeed it would still be necessary to remember which fixtures have been applied, to prevent overwriting changes. For example, if we populate user accounts from fixtures, then any changes made to those accounts by the user, such as a password reset, would be reverted whenever the app restarts.

Co-authored-by: dennisvang <29799340+dennisvang@users.noreply.github.com>

MarekSuchanek · 2025-10-15T04:13:10Z

Hi @dennisvang , so I switched from custom to populators with the last commit (and closed the discussions that are no longer relevant due to that):

password hashing = are we going to allow only passwordHash in fixtures or do we need password and have there some special treatment of the entity to create passwordHash (e.g. by having password as Transient and hash on persisting)?

I would prefer only to allow hashed passwords in the fixtures.

Done, we still need to find suitable spot for declaring how to create the hashed password. Maybe docs?

SHACL in string = it is OK to have metadata schema SHACL definitions inside JSONs (= escaped) instead of linked RDF Turtle file, again with populators we would need some special logic

I think it would be good to have a clear separation between the loading of relational db fixtures and RDF (Turtle) fixtures. The RDF would need a dedicated loader anyway, I suppose.

I discussed this with @luizbonino recently, and we were thinking of storing the actual "metadata schema SHACL definitions" directly in a separate repository in the triple store. The fixtures for the relational db could then just contain the URI of the metadata schema.

What are your thoughts on that?

That would be good; however, that should be separate issue as that will get some tricky parts as well probably (esp. with different versions of metadata schemas, updates, etc.).

schemas = do we still need some JSON schemas for this or we can simply refer to the entity classes?

JSON schemas that I tried to prepare for the custom JSON files. Let's keep it simple now.

remember loaded files = we still want to remember what JSON files were loaded (with hash?), right?

Actually now I'm not so sure this is necessary. If the fixtures include the unique identifier, then we can simply load them always, and they would always be present. Only thing is they should never be modified. If we do need to keep track of history, perhaps a data migration approach would be better.

True, but not sure how that should work for metadata records being loaded from fixtures.

So, could you check and eventually propose changes wrt populators? And then we need to still resolve the records (aside RDF we need information if it is for draft or main repository (can be decided based on file location?) and we probably want the replacement variable with persistent URL as I suggested somehow...

this is more consistent with e.g. the 'locations' option for flyway

This allows us to include the prefix in the config, which is more flexible. For example, we can set file:fixtures for the default fixtures, which need to be overridable in the docker container, and we can set classpath:test-fixtures for the test fixtures, which can then be included in the test/resources dir. Moreover, this approach is similar to the way flyway.locations are specified.

the populator is done, but that does not necessarily mean any repositories were actually populated

This reverts commit 1ac2639.

…re file

* include root logger AppenderRef, so we only need to change the log level, when required * rename test logging config file for clarity and conformance to log4j2 best practices

….setup

…ault 02xx fixtures

…xtures

…04xx fixtures

this enables us to specify simple directories, specific files, filters like 'fixtures/02*.json', and wildcards like 'fixtures/**/*.json'

the Path methods caused errors on windows if the location included ':' or '*' characters

description said Nikola Tesla, but uuid was for Albert Einstein

…ure data changed uuid and type for SearchSavedQuery objects to minimize interference with the existing acceptance tests

Due to the addition of DatabaseBootstrapTests, fixture 0130_test-users-with-api-keys-and-saved-queries.json now includes two new SearchSavedQuery objects. One of these new objects replaces the other, so the total number of SearchSavedQuery objects expected in the test database is increased by one.

Although I prefer .org, existing docs and tests expect .com. Also it will likely lead to confusion because people will keep trying to log in using the .com addresses.

…eTest

…t-data-migrations Bootstrap test data from JSON fixtures instead of SQL migrations

MarekSuchanek requested a review from dennisvang September 22, 2025 07:14

MarekSuchanek added the feature Request for new functionality label Sep 22, 2025

WIP: Add bootstrapping from fixtures

dfdb146

MarekSuchanek force-pushed the feature/634-boostrapping-fdp branch from 85a5919 to dfdb146 Compare September 22, 2025 13:16