geonetwork datadir checker useless ressources by jeanmi151 · Pull Request #11 · georchestra/gaia

jeanmi151 · 2025-05-27T14:37:48Z

The aim of this is to add a checker to spot files that are no longer needed because geonetwork forget to delete them

it checks the database records (metadata table) and search in the /mnt/geonetwork_datadir/data/metadata_data/
(value here https://github.com/georchestra/datadir/blob/docker-master/geonetwork/geonetwork.properties#L2)

geordash/checks/gn_datadir.py

landryb · 2025-06-27T04:12:02Z

i know this is wip/draft, but from the current state of things, this is not at all a background task sent to celery, so if the gn datadir is huge the webpage wont be sent to the client until the size is calculated. and it'll get worse if we add more checks, like checksumming all files and reporting duplicates for example.

And if it's not a proper celery task like others there's no point in adding a card on the home template, which relies on fetching the result of the task from the redis backend.

…lery should help for #11

landryb · 2025-07-01T07:39:40Z

I've added a README in https://github.com/georchestra/gaia/tree/master/geordash/checks trying to document how to add the async task via celery

jeanmi151 · 2025-09-03T09:58:19Z

I am getting somewhere :

jeanmi151 · 2025-09-03T12:56:56Z

can't figure out why in the home page i am getting this message

even if the job as run properly

i am getting this this script.js

 .catch(function(err) { err: TypeError: mydata.value.forEach is not a function
      $(o["prefix"] + '-abstract').html("<span class='bg-danger text-white'>something went wrong</span>")
    });
  })

with this json result from http://localhost:8080/gaia/tasks/lastresultbytask/geordash.checks.gn_datadir.check_gn_meta?taskargs=

{
    args: [],
    completed: null,
    finished: 1756897913.99515,
    ready: true,
    state: "SUCCESS",
    successful: true,
    task: "geordash.checks.gn_datadir.check_gn_meta",
    taskid: "7c53e842-b8b2-4f3a-8b66-200d38486a0f",
    value: {  problems: [ {
                path: "/mnt/geonetwork_datadir/data/metadata_data/00100-00199/199",
                size: "28.83 MB",
                type: "UnusedFileRes"
            }, {
                path: "/mnt/geonetwork_datadir/data/metadata_data/00100-00199/154",
                size: "107.07 KB",
                type: "UnusedFileRes"
            },  {
                path: "/mnt/geonetwork_datadir/data/metadata_data/00100-00199/152",
                size: "107.07 KB",
                type: "UnusedFileRes"
            },  {
                path: "Total",
                size: "29.04 MB",
                total: "166.05 MB",
                type: "UnusedFileResTotal"
            }
        ] }}

@landryb any ideas ?

landryb · 2025-09-04T12:17:15Z

can't figure out why in the home page i am getting this message even if the job as run properly

i am getting this this script.js

 .catch(function(err) { err: TypeError: mydata.value.forEach is not a function
      $(o["prefix"] + '-abstract').html("<span class='bg-danger text-white'>something went wrong</span>")
    });
  })

with this json result from http://localhost:8080/gaia/tasks/lastresultbytask/geordash.checks.gn_datadir.check_gn_meta?taskargs=

{
    args: [],
    completed: null,
    finished: 1756897913.99515,
    ready: true,
    state: "SUCCESS",
    successful: true,
    task: "geordash.checks.gn_datadir.check_gn_meta",
    taskid: "7c53e842-b8b2-4f3a-8b66-200d38486a0f",
    value: {  problems: [ {
                path: "/mnt/geonetwork_datadir/data/metadata_data/00100-00199/199",
                size: "28.83 MB",
                type: "UnusedFileRes"
            }, {
                path: "/mnt/geonetwork_datadir/data/metadata_data/00100-00199/154",
                size: "107.07 KB",
                type: "UnusedFileRes"
            },  {
                path: "/mnt/geonetwork_datadir/data/metadata_data/00100-00199/152",
                size: "107.07 KB",
                type: "UnusedFileRes"
            },  {
                path: "Total",
                size: "29.04 MB",
                total: "166.05 MB",
                type: "UnusedFileResTotal"
            }
        ] }}

@landryb any ideas ?

the calls on the homepage assume that the job results come from a grouptask (which is the case for all other cards on the homepage), and thus loops over results to accumulate problems count in https://github.com/georchestra/gaia/blob/master/geordash/static/js/script.js#L23.

In your case the task is a single task, so value is a dict and not an array of results, and the js blows.

geordash/static/js/script.js

landryb · 2025-09-04T12:28:01Z

This is starting to look good ! i still see some hardcoded bits for the GN database name/host that should come from the new geonetwork.properties parser your rightfully added :)

maybe another cosmetic nitpick, but the whole Bytesize python class in the task looks out of place here, in my view the jobs should just return 'as raw as possible values' (eg an amount of bytes), and that's jinja's task for format it for human consumption.. quickly looking at the jinja docs, filesizeformat is the filter that should do it for you, eg {{ raw_value_in_bytes | filesizeformat }} in the template would allow to drop some code :)

geordash/checks/gn_datadir.py

geordash/dashboard.py

jeanmi151 · 2025-09-11T15:37:44Z

This is starting to look good ! i still see some hardcoded bits for the GN database name/host that should come from the new geonetwork.properties parser your rightfully added :)

maybe another cosmetic nitpick, but the whole Bytesize python class in the task looks out of place here, in my view the jobs should just return 'as raw as possible values' (eg an amount of bytes), and that's jinja's task for format it for human consumption.. quickly looking at the jinja docs, filesizeformat is the filter that should do it for you, eg {{ raw_value_in_bytes | filesizeformat }} in the template would allow to drop some code :)

can't make the filesizeformat function works,
getting "filesizeformat is not defined" error maybe I don't use it in the right place

landryb · 2025-09-12T07:12:46Z

This is starting to look good ! i still see some hardcoded bits for the GN database name/host that should come from the new geonetwork.properties parser your rightfully added :)
maybe another cosmetic nitpick, but the whole Bytesize python class in the task looks out of place here, in my view the jobs should just return 'as raw as possible values' (eg an amount of bytes), and that's jinja's task for format it for human consumption.. quickly looking at the jinja docs, filesizeformat is the filter that should do it for you, eg {{ raw_value_in_bytes | filesizeformat }} in the template would allow to drop some code :)

can't make the filesizeformat function works, getting "filesizeformat is not defined" error maybe I don't use it in the right place

it should work with the jinja2 v3.1.2 in debian12..

$python3 -c "import jinja2 ; print(jinja2.Template(\"{{ bytes|filesizeformat }}\").render(bytes=1572864))"
1.6 MB

and i dont think special imports are needed, eg if in home.html template i put {{ 50000|filesizeformat }} somewhere, it is correctly rendered as 50.0 kB

jeanmi151 · 2025-09-12T08:12:21Z

all good for me, ready for reviews

geordash/checks/gn_datadir.py

landryb

i havent tested the code at runtime yet...

geordash/admin.py

docker/Dockerfile_flask

docker/docker-compose.yml

geordash/checks/gn_datadir.py

landryb · 2025-09-18T14:33:34Z

geordash/dashboard.py

    "dashboard", __name__, url_prefix="/gaia", template_folder="templates/dashboard"
 )

+def debug_only(f):


what is the use of this method ?

it is a new wrap for the function tag use in line 78 for /debug route
is debug it enable answer the /debug content otherwise 404

I would love to keep this debug route, it was super useful to develop

geordash/dashboard.py

landryb · 2025-09-18T14:34:26Z

geordash/georchestraconfig.py

+            parser.read_file(lines)
+        self.sections["geonetwork"] = parser["section"]
+
+    def tostr(self):


i tend to not commit debug-only methods...

Super useful for developing :)
(and debug ;) )
I wish to keep it

geordash/views.py

geordash/checks/gn_datadir.py

landryb · 2025-10-17T14:10:35Z

finally looking at it, i think geordash.checks.gn_datadir is missing from imports in celeryconfig.py.example so that the celery worker finds the job

landryb · 2025-10-17T14:23:58Z

will check monday, but it blew at runtime, i guess it didnt find the database/schema:

Unable to load celery application.
While trying to load the module make_celery the following error occurred:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UndefinedTable: relation "geonetwork.metadata" does not exist
LINE 2: FROM geonetwork.metadata

in my case iirc, geonetwork has its own database so the tables are in the public schema

landryb · 2025-10-20T08:47:58Z

@jeanmi151:

m = MetaData(schema=conf.get("geonetworkSchema"))

afaict, this geonetworkSchema variable doesnt exist in the datadir, be it docker-master or master branch. the docker-master branch has

jdbc.connectionProperties=currentSchema=geonetwork

which is another hack ? i don't have it on my instance (probably because i use the default public in its own db) so i dunno its usecase.. anyway the right var is in jdbc.schema

also, it should first use the jdbc.* vars in geonetwork.properties... while i see you went for the vars from default.properties (which are only a fallback, and im not even 100% sure geonetwork fallbacks to it if it doesnt find the jdbc.* ones in geonetwork.properties

oh and ... definition of the Metadata class harcodes __table_args__ = {"schema": "geonetwork"} ?

landryb · 2025-10-20T09:17:09Z

with the following diff, the job can apparently connect to the public schema of my geonetwork database (havent tried running the job yet), taking params from geonetwork.properties:

--- i/geordash/checks/gn_datadir.py
+++ w/geordash/checks/gn_datadir.py
@@ -33,16 +33,6 @@ import jinja2
 
 Base = declarative_base()
 
-# Define the Metadata model (example schema of a GeoNetwork metadata table)
-class Metadata(Base):
-    __tablename__ = "metadata"
-    __table_args__ = {"schema": "geonetwork"}
-    id = Column(Integer, primary_key=True)
-    uuid = Column(String, unique=True)
-    data = Column(Text)  # Metadata content (e.g., XML or JSON)
-    schemaid = Column(String)  # Metadata schema (e.g., ISO 19115)
-    isharvested = Column(Integer)
-
 def get_folder_size(folder):
     return sum(file.stat().st_size for file in Path(folder).rglob('*'))
 
@@ -61,11 +51,11 @@ class GeonetworkDatadirChecker:
     def __init__(self, conf):
         url = URL.create(
             drivername="postgresql",
-            username=conf.get("pgsqlUser"),
-            host=conf.get("pgsqlHost"),
-            port=conf.get("pgsqlPort"),
-            password=conf.get("pgsqlPassword"),
-            database=conf.get("pgsqlDatabase"),
+            username=conf.get("jdbc.username", "geonetwork"),
+            host=conf.get("jdbc.host", "geonetwork"),
+            port=conf.get("jdbc.port", "geonetwork"),
+            password=conf.get("jdbc.password", "geonetwork"),
+            database=conf.get("jdbc.database", "geonetwork"),
         )
 
         engine = create_engine(url)
@@ -73,12 +63,13 @@ class GeonetworkDatadirChecker:
         self.sessiono = self.sessionm()
 
         # Perform database reflection to analyze tables and relationships
-        m = MetaData(schema=conf.get("geonetworkSchema"))
+        m = MetaData(schema=conf.get("jdbc.schema", "geonetwork"))
         Base = automap_base(metadata=m)
         Base.prepare(
             autoload_with=engine,
-            name_for_collection_relationship=name_for_collection_relationship,
+#            name_for_collection_relationship=name_for_collection_relationship,
         )
+        Metadata = Base.classes.metadata
         self.allmetadatas = self.session().query(Metadata).all()
 
     def session(self):

i had to comment out the collection name thing otherwise it blew with conflicts, and i dont remember why i had to use it for mapstore.

jeanmi151 · 2025-10-20T09:33:01Z

right, I took the wrong var for db connection
when it is in all the same database it was working
I did the change you ask but can't test it right now in docker (because of the current outage)

jeanmi151 · 2025-10-22T08:03:29Z

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "geonetwork.metadata" does not exist
LINE 2: FROM geonetwork.metadata
that error is due to the hardcoding of geonetwork schema in the non-required Metadata class.

if i remove it (as in my diff above), then the next error is abot the relationship naming mapper:
sqlalchemy.exc.ArgumentError: Error creating backref 'guf_userfeedbacks' on relationship 'guf_userfeedbacks.guf_userfeedbacks': property of that name exists on mapper 'mapped class guf_userfeedbacks->guf_userfeedbacks'
so that's why i had to comment out name_for_collection_relationship in Base.prepare

Seems I found a way to correctly request the database in my last commit
Could you try it ?

landryb · 2025-10-22T13:50:29Z

so that's why i had to comment out name_for_collection_relationship in Base.prepare

Seems I found a way to correctly request the database in my last commit Could you try it ?

yeah, it works here now on 4.2.8.

now that i've been able to finally test it at runtime, i have a bit more remarks, some to fix, some which would be welcom improvements, and some for future work ? so here's a proper review:

the button is labeled check all geonetwork datadir configs now, obvious copypaste
the presentation of the results could be vastly improved by using a bootstrap table, for consistency with other pages
i don't think there's a point in repeating over and over Folder is useless & the full path to the datadir, eg we could only store the internal md id, and display the full datadir path at the top of the page only once, with its total size ?

Folder is useless '/data/webapps/geonetwork/data/metadata_data/98119700-98119799/98119718' with size ' 8.2 kB '
Folder is useless '/data/webapps/geonetwork/data/metadata_data/185357700-185357799/185357767' with size ' 87.9 kB '
Folder is useless '/data/webapps/geonetwork/data/metadata_data/185357700-185357799/185357773' with size ' 91.2 kB '
Folder is useless '/data/webapps/geonetwork/data/metadata_data/184383400-184383499/184383480' with size ' 44.0 MB '
In total '275.3 MB' could be saved on '1.3 GB'

why the quotes/spaces around the sizes ?
the display of the card on the homepage works fine (eg 380 errors found in red) but it could be nice to have the amount of entries checked like others
another thing that bothers me, you fetch the md list from the database at GeonetworkDatadirChecker object creation, which afaict only happens once at gaia startup. If some md is added/delete, i think that wont be taken into account, so maybe the self.allmetadatas = self.session().query(Metadata).all() line should be moved to the get_meta_list implem (and the class member removed, as now useless...)
at some point in the history of geonetwork, there used to be a files table that referenced all those files in the datadir, but now it seems its empty, and that info is partially available in the metadatafileuploads table (filename & metadataid). afaict your check mostly checks that the currently iterated folder id exists in the metadata table, right ?
if so, i fear that it will only detect leftover folders from removed metadatas, but wont detect files that are present in the private/public subdirs, but not actually referenced by the metadata..
to be fully complete, i think all files found in subfolders should be checked as present in the metadata xml , eg look for <FQDN>/geonetwork/srv/api/records/<uuid>/attachments/<file> in the gmd:graphicOverview & gmd:CI_OnlineResource xml tags ? and/or in the metadatafileuploads table ?

jeanmi151 · 2025-10-23T08:13:10Z

the button is labeled check all geonetwork datadir configs now, obvious copypaste

okay

the presentation of the results could be vastly improved by using a bootstrap table, for consistency with other pages

I will try to do such thing but I am not super confident with front stuff

i don't think there's a point in repeating over and over Folder is useless & the full path to the datadir, eg we could only store the internal md id, and display the

okay

* why the quotes/spaces around the sizes ?

I removed them

* the display of the card on the homepage works fine (eg `380 errors found` in red) but it could be nice to have the amount of entries checked like others

Like total amount of folder ?

* another thing that bothers me, you fetch the md list from the database at `GeonetworkDatadirChecker` object creation, which afaict only happens once at gaia startup. If some md is added/delete, i think that wont be taken into account, so maybe the `self.allmetadatas = self.session().query(Metadata).all()` line should be moved to the `get_meta_list` implem (and the class member removed, as now useless...)

Okay I will correct that

* at some point in the history of geonetwork, there used to be a `files` table that referenced all those files in the datadir, but now it seems its empty, and that info is partially available in the `metadatafileuploads` table (`filename` & `metadataid`). afaict your check mostly checks that the currently iterated folder id exists in the `metadata` table, right ?

metadatafileuploads table is not complete enough, yes we are listing the metadata and folder and check for correspondance

* if so, i fear that it will only detect _leftover folders from removed metadatas_, but wont detect files that are present in the `private`/`public` subdirs, but not actually referenced by the metadata..

" it will only detect _leftover folders from removed metadatas" --> yes it is what we are aiming here
"the private/public subdirs, but not actually referenced by the metadata.." --> rigth but this is much more complexe, I was trying to start little on this one

* to be fully complete, i think all files found in subfolders should be checked as present in the metadata xml , eg look for `<FQDN>/geonetwork/srv/api/records/<uuid>/attachments/<file>` in the `gmd:graphicOverview` & `gmd:CI_OnlineResource` xml tags ? and/or in the `metadatafileuploads` table ?

we will keep that for futur improvments I think but it is a good idea

jeanmi151 marked this pull request as draft May 27, 2025 14:37

jeanmi151 changed the title ~~first tries for geonetwork datadir checker useless ressources~~ geonetwork datadir checker useless ressources May 27, 2025

landryb reviewed Jun 26, 2025

View reviewed changes

geordash/checks/gn_datadir.py Outdated Show resolved Hide resolved

landryb reviewed Jun 26, 2025

View reviewed changes

geordash/checks/gn_datadir.py Outdated Show resolved Hide resolved

landryb added a commit that referenced this pull request Jul 1, 2025

add README to checks, tries to document how to add a new check via ce…

e4e92ce

…lery should help for #11

jeanmi151 force-pushed the datadir_gn_checker branch 2 times, most recently from c61110d to 9bf6ea5 Compare August 6, 2025 15:41

jeanmi151 force-pushed the datadir_gn_checker branch from c52e6d8 to eb011cf Compare September 3, 2025 07:36

landryb reviewed Sep 4, 2025

View reviewed changes

geordash/static/js/script.js Outdated Show resolved Hide resolved

jeanmi151 commented Sep 5, 2025

View reviewed changes

geordash/checks/gn_datadir.py Outdated Show resolved Hide resolved

jeanmi151 commented Sep 5, 2025

View reviewed changes

geordash/checks/gn_datadir.py Outdated Show resolved Hide resolved

jeanmi151 commented Sep 5, 2025

View reviewed changes

geordash/checks/gn_datadir.py Outdated Show resolved Hide resolved

jeanmi151 commented Sep 5, 2025

View reviewed changes

geordash/dashboard.py Show resolved Hide resolved

jeanmi151 marked this pull request as ready for review September 12, 2025 08:12

landryb reviewed Sep 12, 2025

View reviewed changes

geordash/checks/gn_datadir.py Show resolved Hide resolved

landryb reviewed Sep 18, 2025

View reviewed changes

jeanmi151 and others added 18 commits October 21, 2025 18:03

continue work for adding geonetwork datadir check inside gaia

4118c63

feat(datadir): some desgin to show result of datadir check

711c282

add debug function entry point

09587e1

add back the build in docker compose

ee47cdd

rework WIP as a task the geonetwork datadir checker

1fa5c5d

continue but still not working

a726891

continue testing

20971c2

finaly something working

7902e43

make dynamic folder search and change home dashboard

038ab38

add total count

c9848a0

revert wrong merge

a471ce2

continue test and dev, take care about reviews

ea4ef11

make geonetwork datadir dynamic + filesizeformat using jinja2

57c1f49

activate debug route only if debug is ON

4b71648

cleaning before ready for reviews

2fc440f

cleaning before ready for reviews2

1818d19

taking review into account

87fd6db

correct database arguments variables

a6bc3b5

jeanmi151 force-pushed the datadir_gn_checker branch from c83ff8c to a6bc3b5 Compare October 21, 2025 16:04

jeanmi151 added 2 commits October 22, 2025 09:57

correct docker file and compose

503f7c1

working way to request database without providing the Metadata class

e6150ea

Merge branch 'master' into datadir_gn_checker

3c39a71

move the query to get all metadata for dynamic updates

9d1d78b

jeanmi151 closed this Oct 23, 2025

jeanmi151 deleted the datadir_gn_checker branch October 23, 2025 15:01

jeanmi151 mentioned this pull request Oct 23, 2025

geonetwork datadir checker useless ressources #35

Merged

landryb referenced this pull request Oct 24, 2025

result of gn checker in beautifull table

880b596

Conversation

jeanmi151 commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

landryb commented Jun 27, 2025

Uh oh!

landryb commented Jul 1, 2025

Uh oh!

jeanmi151 commented Sep 3, 2025

Uh oh!

jeanmi151 commented Sep 3, 2025

Uh oh!

landryb commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

landryb commented Sep 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeanmi151 commented Sep 11, 2025

Uh oh!

landryb commented Sep 12, 2025

Uh oh!

jeanmi151 commented Sep 12, 2025

Uh oh!

Uh oh!

landryb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

landryb Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

jeanmi151 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

jeanmi151 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

landryb Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

jeanmi151 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

landryb commented Oct 17, 2025

Uh oh!

landryb commented Oct 17, 2025

Uh oh!

landryb commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

landryb commented Oct 20, 2025

Uh oh!

jeanmi151 commented Oct 20, 2025

Uh oh!

jeanmi151 commented Oct 22, 2025

Uh oh!

landryb commented Oct 22, 2025

Uh oh!

jeanmi151 commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

jeanmi151 commented May 27, 2025 •

edited

Loading

landryb commented Sep 4, 2025 •

edited

Loading

landryb commented Oct 20, 2025 •

edited

Loading