Use database store for media library #316

bennothommo · 2021-09-20T15:00:23Z

This PR will use the database for tracking file and folder metadata, and adds the grass-roots functionality to allow extension of media item metadata (#16)

The benefits of using the DB to handle this sort of data include:

Databases tend to be faster and better at sorting and filtering, contrarily most filesystems must be read in their entirety before they can have the same sorting and filtering done.
There's no real way for us to paginate the filesystem, whereas using the database could allow us to limit records and retrieve a file list in batches - this will assist with resolving issues with large media libraries (reported several times in the October repo).
Easily extendable.

The media manager will scan the file system on first use and populate the media metadata table - this has not yet been tested with remote filesystems or an extreme amount of files, but works with around 100 images in a couple of seconds. The intention is to make an Artisan command which can run the scan, but I'm sure optimisations can be made. Subsequent scans will compare the stored metadata with the filesystem and will only update files or folders that have been added, modified or removed.

More details on this PR will be forthcoming once it's closer to completion - it works now for browsing at the least.

LukeTowers · 2021-09-20T15:32:44Z

modules/system/database/migrations/2021_10_02_000026_Db_Media_Items.php

+            $table->integer('parent_id')->unsigned()->nullable();
+            $table->integer('nest_left')->unsigned()->nullable();
+            $table->integer('nest_right')->unsigned()->nullable();
+            $table->integer('nest_depth')->unsigned()->nullable();


I'm concerned keeping the nested tree data in sync with the path would be more complicated than just using the path by itself. We should be able to do queries based on the path, and we wouldn't want order inside of a path to be determined by integers in the database rather than the file attributes like path, size, date modified, etc.

I felt using nested set will make it easy to modify hierarchical data in bulk, as opposed to using any form of parent_id or trying to substring a path to work out multiple levels of child structure.

I definitely wasn't going to use it for ordering, that'll just be name, size, date modified as it is now.

Well I'll wait to see the implementation but I'm concerned about the reliability

I've used its abilities here:
https://github.com/wintercms/winter/pull/316/files#diff-cd9950fa78d52abcf50430faf1b4d47b0ece142fbef305253728d05ac684d7d1R170
and here:
https://github.com/wintercms/winter/pull/316/files#diff-cd9950fa78d52abcf50430faf1b4d47b0ece142fbef305253728d05ac684d7d1R207
so far. It's been really good for traversing the whole structure underneath a particular item.

A competing CMS solved this using a variation of path enumeration:
https://github.com/drupal/drupal/blob/9.3.x/core/lib/Drupal/Core/Menu/MenuTreeStorage.php#L529 (they do limit the depth to 10 levels)

Also based on the below comparison nested sets are somewhat more "difficult"

@lex0r really sorry for the delay in responding to this.

Thanks for posting that, however, I slightly disagree with what it puts forward. While nested sets might be hard to wrangle if you build the functionality from scratch, we already have a NestedTree trait which does all the heavy lifting, so it's trivial for us to use.

All the actions - querying, deleting, inserting and moving - are simple through this API.

I'm not quite sure what the "referential integrity" column refers to, however, my previous experience with nested trees finds them to be incredibly reliable and auditable. They can even be easily repaired.

The SimpleTree trait handles the "Adjacency List" design you mention, but as your diagram mentions, it's harder to query a subtree, and more to the point, that's a design fault - you have to run several queries in order to find the full subtree, which is not useful in this case if we want to do filtering and searching. Nested set, by design, handles this flawlessly.

Hi @bennothommo
I agree with your argument that NestedTree trait is trivial to use, and I understand your experience using them may be positive, but I'd like to share my experience with them in a real world project. To keep it short, nested sets don't survive parallel modifications of the tree structure, and if that happens they are guaranteed to get corrupted. For me it's a big no when it comes to applications used by more than 1 person. I appreciate they are easy to fix (thanks to the parent ref) but why fixing what shouldn't even break?

On the other hand, path enumeration (also known as materialised path) is used in many CMS like Drupal (sorry for repeating that again) and it handles parallel modifications to the trees easily. To make it more attractive, I will mention that I was able to easily implement it using a very slightly modified version of https://github.com/vicklr/materialized-model.

Hope you could give it a try.

@lex0r thanks for the further feedback.

I would think in your case that corruption of the nested set could be easily resolved by implementing table locking, so there would be no race condition on multiple writes. But I digress.

I would have to see the benchmarks, but path enumeration seems to me - based on my knowledge on MySQL - that it might be slow for reading large datasets and querying individual nodes and sub-nodes, which we're likely to have with media storage. I've heard of people with hundreds of thousands of files within their media libraries. The diagram above says its easy, but it's relying on string parsing which is AFAIK not the greatest performance-wise in SQL.

I'll run some tests though - if it does happen to be fast, then I'll certainly consider it. If you have access to some benchmarks of the different types of hierarchical data, that would be most helpful.

@bennothommo benchmarking should help, and an obvious performance improvement for the path enumeration concept is to have all path components as individual columns, eg. p1, p2, p3 .. p10, where pN is an ID rather than a string. This will limit the depth but rarely people need a super deep nesting. Alternatively, having an index on the path is a scalable approach especially when talking about the ID-based path (/1/123/544) not the file path (/content/images/abc). The index will be small as only IDs are stored, which are shorter strings, unlike path component strings, and it will be used in all searches starting with a known prefix (which I guess is always known).

Table locking is a big performance killer, I tried that and it left me surprised with how slow it can be, so can't recommend it as a solution for a modern multi-user CMS.

modules/system/database/migrations/2021_10_02_000026_Db_Media_Items.php

LukeTowers · 2023-08-31T08:42:55Z

modules/system/lang/en/lang.php

+        'folder_size_items' => 'item(s)',
+        'metadata_image_width' => 'Width',
+        'metadata_image_height' => 'Height',
+        'metadata_video_duration' => 'Duration',


metadata should be an array with subitems under it so that keys are metadata.image_width instead of metadata_image_width - this will reduce the unnecessary duplication of the metadata_ prefix.

bennothommo added 5 commits September 20, 2021 13:34

Add media item DB table migration, initial model

ac837c1

Fix migration

4d6bab3

Initial work on media library scanning

81e9b1d

Use synchronisation method for scans

9789e9c

Initial preview of media browsing running from DB

428e240

bennothommo added enhancement PRs that implement a new feature or substantial change Status: In Progress labels Sep 20, 2021

bennothommo added this to the v1.2.0 milestone Sep 20, 2021

bennothommo marked this pull request as draft September 20, 2021 15:00

LukeTowers reviewed Sep 20, 2021

View reviewed changes

modules/system/database/migrations/2021_10_02_000026_Db_Media_Items.php Outdated Show resolved Hide resolved

bennothommo added 6 commits September 21, 2021 15:08

Add metadata array in model, add method to set metadata, fix tests

b1e0b99

Remove trailing slash - PHP 7.2 doesn't support them for params

b621889

Implement search

5eb93cf

Add directory list, bug fixes

c085e4f

Fix potential error

c11ac26

Further fixes to tests

8da2271

LukeTowers assigned LukeTowers and bennothommo Feb 2, 2022

LukeTowers modified the milestones: v1.2.0, v1.2.1 May 12, 2022

bennothommo modified the milestones: v1.2.1, v1.2.2 Sep 12, 2022

LukeTowers modified the milestones: v1.2.2, v1.2.3 Apr 20, 2023

LukeTowers modified the milestones: v1.2.3, v1.2.4 Jul 7, 2023

bennothommo added 3 commits August 31, 2023 14:42

Merge branch 'develop' into wip/db-media-items

23d9797

Update migration, use path enumeration for media items

2bc991f

Fix logic to work more gracefully with Flysystem

bcbcd68

Use new scope name

f3ff8cc

LukeTowers reviewed Aug 31, 2023

View reviewed changes

LukeTowers modified the milestones: v1.2.4, 1.2.5 Dec 27, 2023

mjauvin modified the milestones: 1.2.5, 1.2.6 Feb 18, 2024

LukeTowers modified the milestones: 1.2.6, 1.3.0 Apr 25, 2024

bennothommo removed the Status: In Progress label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use database store for media library #316

Use database store for media library #316

Uh oh!

bennothommo commented Sep 20, 2021 •

edited by gitpod-io bot

Loading

Uh oh!

LukeTowers Sep 20, 2021

Uh oh!

bennothommo Sep 21, 2021

Uh oh!

LukeTowers Sep 21, 2021

Uh oh!

bennothommo Sep 22, 2021

Uh oh!

lex0r Oct 1, 2021 •

edited

Loading

Uh oh!

bennothommo Feb 7, 2022

Uh oh!

lex0r Mar 21, 2023 •

edited

Loading

Uh oh!

bennothommo Mar 27, 2023

Uh oh!

lex0r Mar 29, 2023 •

edited

Loading

Uh oh!

Uh oh!

LukeTowers Aug 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Use database store for media library #316

Are you sure you want to change the base?

Use database store for media library #316

Uh oh!

Conversation

bennothommo commented Sep 20, 2021 • edited by gitpod-io bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukeTowers Sep 20, 2021

Choose a reason for hiding this comment

Uh oh!

bennothommo Sep 21, 2021

Choose a reason for hiding this comment

Uh oh!

LukeTowers Sep 21, 2021

Choose a reason for hiding this comment

Uh oh!

bennothommo Sep 22, 2021

Choose a reason for hiding this comment

Uh oh!

lex0r Oct 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bennothommo Feb 7, 2022

Choose a reason for hiding this comment

Uh oh!

lex0r Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bennothommo Mar 27, 2023

Choose a reason for hiding this comment

Uh oh!

lex0r Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LukeTowers Aug 31, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bennothommo commented Sep 20, 2021 •

edited by gitpod-io bot

Loading

lex0r Oct 1, 2021 •

edited

Loading

lex0r Mar 21, 2023 •

edited

Loading

lex0r Mar 29, 2023 •

edited

Loading