Skip to content

Conversation

@bennothommo
Copy link
Member

@bennothommo bennothommo commented Sep 20, 2021

This PR will use the database for tracking file and folder metadata, and adds the grass-roots functionality to allow extension of media item metadata (#16)

The benefits of using the DB to handle this sort of data include:

  • Databases tend to be faster and better at sorting and filtering, contrarily most filesystems must be read in their entirety before they can have the same sorting and filtering done.
  • There's no real way for us to paginate the filesystem, whereas using the database could allow us to limit records and retrieve a file list in batches - this will assist with resolving issues with large media libraries (reported several times in the October repo).
  • Easily extendable.

The media manager will scan the file system on first use and populate the media metadata table - this has not yet been tested with remote filesystems or an extreme amount of files, but works with around 100 images in a couple of seconds. The intention is to make an Artisan command which can run the scan, but I'm sure optimisations can be made. Subsequent scans will compare the stored metadata with the filesystem and will only update files or folders that have been added, modified or removed.

More details on this PR will be forthcoming once it's closer to completion - it works now for browsing at the least.

@bennothommo bennothommo added enhancement PRs that implement a new feature or substantial change Status: In Progress labels Sep 20, 2021
@bennothommo bennothommo added this to the v1.2.0 milestone Sep 20, 2021
@bennothommo bennothommo marked this pull request as draft September 20, 2021 15:00
$table->integer('parent_id')->unsigned()->nullable();
$table->integer('nest_left')->unsigned()->nullable();
$table->integer('nest_right')->unsigned()->nullable();
$table->integer('nest_depth')->unsigned()->nullable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned keeping the nested tree data in sync with the path would be more complicated than just using the path by itself. We should be able to do queries based on the path, and we wouldn't want order inside of a path to be determined by integers in the database rather than the file attributes like path, size, date modified, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt using nested set will make it easy to modify hierarchical data in bulk, as opposed to using any form of parent_id or trying to substring a path to work out multiple levels of child structure.

I definitely wasn't going to use it for ordering, that'll just be name, size, date modified as it is now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I'll wait to see the implementation but I'm concerned about the reliability

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@lex0r lex0r Oct 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A competing CMS solved this using a variation of path enumeration:
https://github.com/drupal/drupal/blob/9.3.x/core/lib/Drupal/Core/Menu/MenuTreeStorage.php#L529 (they do limit the depth to 10 levels)

Also based on the below comparison nested sets are somewhat more "difficult"
image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lex0r really sorry for the delay in responding to this.

Thanks for posting that, however, I slightly disagree with what it puts forward. While nested sets might be hard to wrangle if you build the functionality from scratch, we already have a NestedTree trait which does all the heavy lifting, so it's trivial for us to use.

All the actions - querying, deleting, inserting and moving - are simple through this API.

I'm not quite sure what the "referential integrity" column refers to, however, my previous experience with nested trees finds them to be incredibly reliable and auditable. They can even be easily repaired.

The SimpleTree trait handles the "Adjacency List" design you mention, but as your diagram mentions, it's harder to query a subtree, and more to the point, that's a design fault - you have to run several queries in order to find the full subtree, which is not useful in this case if we want to do filtering and searching. Nested set, by design, handles this flawlessly.

Copy link
Contributor

@lex0r lex0r Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bennothommo
I agree with your argument that NestedTree trait is trivial to use, and I understand your experience using them may be positive, but I'd like to share my experience with them in a real world project. To keep it short, nested sets don't survive parallel modifications of the tree structure, and if that happens they are guaranteed to get corrupted. For me it's a big no when it comes to applications used by more than 1 person. I appreciate they are easy to fix (thanks to the parent ref) but why fixing what shouldn't even break?

On the other hand, path enumeration (also known as materialised path) is used in many CMS like Drupal (sorry for repeating that again) and it handles parallel modifications to the trees easily. To make it more attractive, I will mention that I was able to easily implement it using a very slightly modified version of https://github.com/vicklr/materialized-model.

Hope you could give it a try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lex0r thanks for the further feedback.

I would think in your case that corruption of the nested set could be easily resolved by implementing table locking, so there would be no race condition on multiple writes. But I digress.

I would have to see the benchmarks, but path enumeration seems to me - based on my knowledge on MySQL - that it might be slow for reading large datasets and querying individual nodes and sub-nodes, which we're likely to have with media storage. I've heard of people with hundreds of thousands of files within their media libraries. The diagram above says its easy, but it's relying on string parsing which is AFAIK not the greatest performance-wise in SQL.

I'll run some tests though - if it does happen to be fast, then I'll certainly consider it. If you have access to some benchmarks of the different types of hierarchical data, that would be most helpful.

Copy link
Contributor

@lex0r lex0r Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bennothommo benchmarking should help, and an obvious performance improvement for the path enumeration concept is to have all path components as individual columns, eg. p1, p2, p3 .. p10, where pN is an ID rather than a string. This will limit the depth but rarely people need a super deep nesting. Alternatively, having an index on the path is a scalable approach especially when talking about the ID-based path (/1/123/544) not the file path (/content/images/abc). The index will be small as only IDs are stored, which are shorter strings, unlike path component strings, and it will be used in all searches starting with a known prefix (which I guess is always known).

Table locking is a big performance killer, I tried that and it left me surprised with how slow it can be, so can't recommend it as a solution for a modern multi-user CMS.

@LukeTowers LukeTowers modified the milestones: v1.2.0, v1.2.1 May 12, 2022
@bennothommo bennothommo modified the milestones: v1.2.1, v1.2.2 Sep 12, 2022
@LukeTowers LukeTowers modified the milestones: v1.2.2, v1.2.3 Apr 20, 2023
@LukeTowers LukeTowers modified the milestones: v1.2.3, v1.2.4 Jul 7, 2023
'folder_size_items' => 'item(s)',
'metadata_image_width' => 'Width',
'metadata_image_height' => 'Height',
'metadata_video_duration' => 'Duration',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata should be an array with subitems under it so that keys are metadata.image_width instead of metadata_image_width - this will reduce the unnecessary duplication of the metadata_ prefix.

@LukeTowers LukeTowers modified the milestones: v1.2.4, 1.2.5 Dec 27, 2023
@mjauvin mjauvin modified the milestones: 1.2.5, 1.2.6 Feb 18, 2024
@LukeTowers LukeTowers modified the milestones: 1.2.6, 1.3.0 Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement PRs that implement a new feature or substantial change

Projects

Status: In Development

Development

Successfully merging this pull request may close these issues.

5 participants