Skip to content

Conversation

@bradenmacdonald
Copy link
Member

@bradenmacdonald bradenmacdonald commented Mar 11, 2025

This PR refactors Unit so it's just a proxy model for Container and doesn't use a separate table. There is a new container_type column on the Container object to specify that it's a unit (or not).

One thing to note is that proxy models like Unit (after this refactor) cannot declare fields:

class UnitVersion(ContainerVersion):
    """
    A UnitVersion is a ContainerVersion, which is a PublishableEntityVersion.
    """
    class Meta:
        proxy = True

    # ❌ This is forbidden - proxy models cannot declare fields:
    # details = models.ForeignKey("UnitMetadata", on_delete=models.CASCADE)

class UnitMetadata(models.Model):

    foo_data = case_sensitive_char_field(max_length=500)

    # ✅ This is allowed, but the database only partially enforces it - e.g. would allow this to point to a non-Unit ContainerVersion
    unit_version = models.ForeignKey(UnitVersion, on_delete=models.CASCADE)

@ormsbee
Copy link

ormsbee commented Mar 11, 2025

Maybe it would be more intuitive to just use regular multi-table inheritance then? We'd spawn a bunch of per-container tables, but it'll probably be easier for folks to grok?

(Really, I'm just looking for excuses to use every ORM feature out there at this point. 😜

@bradenmacdonald
Copy link
Member Author

@ormsbee I tried sketching out a multi-table inheritance version, but it doesn't seem to play nicely with our publishing mixins :/ #17 If you have any ideas on how to get past those bugs or refactor to avoid them, I'm all ears. Otherwise I can't seem to get it working cleanly.

@bradenmacdonald
Copy link
Member Author

I actually like the approach in this PR, and think it would work well if we're willing to use a JSONField for each container's metadata.

@kdmccormick
Copy link

kdmccormick commented Mar 11, 2025

JSONField for each container's metadata

I have a gut reaction against JSONFields in general, but maybe I've been overindexing on that. Concretely, what do we lose with Proxy Models + JSONFields, as compared to what exists now in openedx#278 ? I can think of only two things:

  1. [proxy models] Ability to make a db-level FK to a Unit/Subsection/etc rather than a generic Container. Keep in mind, Django will still allow us to write unit = ForeignKey(Unit) and mypy will understand that to be a Unit rather than just a Container. What we're losing, it seems, is just the db-level validation (and perhaps a performance hit, though I'm not knowledgeable enough to say that for sure).
  2. [json fields] Ability to efficiently filter and sort on a Unit/Subsection/etc-specific metadata field. For example, Unit.objects.filter(discussions_enabled=True). This is why I had suggested <Container>Metadata tables rather than a JSON field. But, it's possible that these queries are rare enough that the extra complexity and the extra joins would not be worth the cost.

@bradenmacdonald if you've considered both these drawbacks and see them as no big deal, then that resolves my objections. Curious to hear @ormsbee 's opinion too.

Lastly, I imagine we're going to be in a very similar situation over in Component-land when if we ever want to start putting metadata about ProblemBlocks, VideoBlocks, LTIBlocks, etc. into the database. Would we apply this same solution there? If not, why?

@kdmccormick
Copy link

Oh, I just realized that MySQL has a proper JSON column type as of 5.7. I'd had it in my head that was a Postgres-only thing. So queries like this will run performantly?

Unit.objects.filter(metadata__discussions_enabled=True)

If so, then the drawback of JSONField becomes less about efficiency, and more about validation and type-safety of the metadata schema, which we could instead handle decently well at the Python API layer using attrs or dataclasses.

@bradenmacdonald
Copy link
Member Author

bradenmacdonald commented Mar 11, 2025

@kdmccormick Yes. Queries within JSON fields won't use an index, but they'll run at the database level, pretty efficiently. You can even create an index on some fields inside the JSON value using generated columns or functional indexes, but at that point it's probably better to just use a regular column (Edit: these type of indexes can't be managed easily using Django but MySQL does support it). I've used PostgeSQL JSON fields for years and found them to be great. I imagine MySQL's are similar.

So your second point doesn't apply. I would say the drawbacks of JSON fields for this use case are:

  1. No schema/migrations for the metadata fields. You can still use data migrations of course.
  2. No data type enforcement at the DB level, and no default values at the DB level. (discussions_enabled could be set to a string, for example).
  3. Metadata fields in JSON cannot be foreign keys to other models (well they can, but there's no FK enforcement).

And as you said, the drawback of the proxy models approach is also related to foreign keys: you can't declare any new fields (including foreign keys) on Unit if it's a proxy model, and the DB won't enforce that keys to unit are not made to other container types (but Django will at the python level, I believe).

Also, these approaches (JSON metadata, proxy models) don't have to be coupled. We can use either one without the other.

@ormsbee
Copy link

ormsbee commented Mar 11, 2025

@kdmccormick: It will not run efficiently unless you're running PostgreSQL. It "works" in MySQL insofar as it will ensure that it's valid JSON, but the indexing is not helpful. From the docs:

Indexing
Index and Field.db_index both create a B-tree index, which isn’t particularly helpful when querying JSONField. On PostgreSQL only, you can use GinIndex that is better suited.

@bradenmacdonald
Copy link
Member Author

bradenmacdonald commented Mar 11, 2025

@ormsbee In today's platform, we don't have any way to create indexes on metadata fields like discussion_enabled anyways, right?

I'm not sure our particular access patterns will ever warrant indexed metadata fields. If you're filtering on discussion_enabled, you're probably doing it in a learning context. A course is going to have on the order of 10s or at most low 100s of units within the learning package, and libraries may have more units but generally are searched/filtered using Meilisearch which can index anything we want. So my assumption is that if you can already efficiently filter the set of units down to the current published ones in a given learning context, you don't need indexes beyond that.

Maybe it would be useful when we're talking about components and not units, but even then I'm not sure.

But yeah a query like "of all units in all courses and libraries in the platform, what percent have discussion_enabled" will not be very efficient in a JSON field, but shouldn't be significantly slower than if it were a regular unindexed column.

@ormsbee
Copy link

ormsbee commented Mar 11, 2025

No, we can't efficiently query for discussion_enabled at the moment.

I do share @kdmccormick's misgivings about having a generic catch-all JSON field because it encourages a behavior where the natural pattern is to just lump more stuff in it under a namespace, like how CourseOverview just continues to grow new fields.

I do think that we'll want to use JSONField in areas, but I would like the extension pattern to encourage grouping by models. For instance, maybe there's an XBlockVersionData model that has a JSONField for XBlock field data (or maybe a couple of we want to differentiate settings vs. content scope). That could represent a sort of "compiled" version of the OLX that would allow us to include files, e.g. allow specifying an HTML file and reading that data in. But that's locked in its own model, and isn't going to be shared by a bunch of other apps that want to add their own data, where we're never quite sure who has written what and why.

@bradenmacdonald
Copy link
Member Author

bradenmacdonald commented Mar 11, 2025

I do agree with your concern, though I don't think it's only JSONField that enables the problem you're describing. After all, you use CourseOverview as an example, but it's a full-fledged django model / DB table. I think that fundamentally, we will soon need a systematic mechanism for extending models with additional data, and I personally don't think it matters too much if it's JSONField or several Django models backing it, as long as there is a good extension API surrounding it. People keep throwing fields into CourseOverview because it's way more work to create a new similar model and make it up to date - in fact it's almost impossible, because the extension points simply don't exist.

We also don't need to solve this "modelling data fields for extensions" problem now - we just need to settle on a Units model that will support a nice option in the future.

But from my perspective, such a system should make it easy to add new namespaced data fields (i.e. easy to register a new model like XBlockVersionData), and whether that means that there's a concrete model with foreign keys to ComponentVersion or a namespaced area of a JSONField, what's important is that there's a consistent mechanism, it's namespaced / organized by plugin, it automatically does the right thing regarding versioning / import-export / serialization / REST API access / python API access, and that Learning Core's "Core" actually standardizes a solid subset of common fields so that we can use them in other extensions.


What I like about JSONField is that it makes versioning / import-export / serialization / REST API access / python API access trivial to implement, and it works for all plugins. Implementing the equivalents with proper django models and plugin registration is also very feasible, but definitely more work (though with benefits like schema, indexing, foreign keys).

@bradenmacdonald
Copy link
Member Author

Closing in favor of #17 . I still like JSONFields though :p

@bradenmacdonald bradenmacdonald deleted the refactor-containers branch March 14, 2025 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants