Skip to content

geonetwork datadir checker useless ressources#11

Closed
jeanmi151 wants to merge 23 commits intogeorchestra:masterfrom
jeanmi151:datadir_gn_checker
Closed

geonetwork datadir checker useless ressources#11
jeanmi151 wants to merge 23 commits intogeorchestra:masterfrom
jeanmi151:datadir_gn_checker

Conversation

@jeanmi151
Copy link
Collaborator

@jeanmi151 jeanmi151 commented May 27, 2025

The aim of this is to add a checker to spot files that are no longer needed because geonetwork forget to delete them

it checks the database records (metadata table) and search in the /mnt/geonetwork_datadir/data/metadata_data/
(value here https://github.com/georchestra/datadir/blob/docker-master/geonetwork/geonetwork.properties#L2)

@jeanmi151 jeanmi151 marked this pull request as draft May 27, 2025 14:37
@jeanmi151 jeanmi151 changed the title first tries for geonetwork datadir checker useless ressources geonetwork datadir checker useless ressources May 27, 2025
@landryb
Copy link
Member

landryb commented Jun 27, 2025

i know this is wip/draft, but from the current state of things, this is not at all a background task sent to celery, so if the gn datadir is huge the webpage wont be sent to the client until the size is calculated. and it'll get worse if we add more checks, like checksumming all files and reporting duplicates for example.

And if it's not a proper celery task like others there's no point in adding a card on the home template, which relies on fetching the result of the task from the redis backend.

landryb added a commit that referenced this pull request Jul 1, 2025
@landryb
Copy link
Member

landryb commented Jul 1, 2025

I've added a README in https://github.com/georchestra/gaia/tree/master/geordash/checks trying to document how to add the async task via celery

@jeanmi151 jeanmi151 force-pushed the datadir_gn_checker branch 2 times, most recently from c61110d to 9bf6ea5 Compare August 6, 2025 15:41
@jeanmi151
Copy link
Collaborator Author

I am getting somewhere :
image

@jeanmi151
Copy link
Collaborator Author

can't figure out why in the home page i am getting this message
image
even if the job as run properly

i am getting this this script.js

 .catch(function(err) { err: TypeError: mydata.value.forEach is not a function
      $(o["prefix"] + '-abstract').html("<span class='bg-danger text-white'>something went wrong</span>")
    });
  })

with this json result from http://localhost:8080/gaia/tasks/lastresultbytask/geordash.checks.gn_datadir.check_gn_meta?taskargs=

{
    args: [],
    completed: null,
    finished: 1756897913.99515,
    ready: true,
    state: "SUCCESS",
    successful: true,
    task: "geordash.checks.gn_datadir.check_gn_meta",
    taskid: "7c53e842-b8b2-4f3a-8b66-200d38486a0f",
    value: {  problems: [ {
                path: "/mnt/geonetwork_datadir/data/metadata_data/00100-00199/199",
                size: "28.83 MB",
                type: "UnusedFileRes"
            }, {
                path: "/mnt/geonetwork_datadir/data/metadata_data/00100-00199/154",
                size: "107.07 KB",
                type: "UnusedFileRes"
            },  {
                path: "/mnt/geonetwork_datadir/data/metadata_data/00100-00199/152",
                size: "107.07 KB",
                type: "UnusedFileRes"
            },  {
                path: "Total",
                size: "29.04 MB",
                total: "166.05 MB",
                type: "UnusedFileResTotal"
            }
        ] }}

@landryb any ideas ?

@landryb
Copy link
Member

landryb commented Sep 4, 2025

can't figure out why in the home page i am getting this message image even if the job as run properly

i am getting this this script.js

 .catch(function(err) { err: TypeError: mydata.value.forEach is not a function
      $(o["prefix"] + '-abstract').html("<span class='bg-danger text-white'>something went wrong</span>")
    });
  })

with this json result from http://localhost:8080/gaia/tasks/lastresultbytask/geordash.checks.gn_datadir.check_gn_meta?taskargs=

{
    args: [],
    completed: null,
    finished: 1756897913.99515,
    ready: true,
    state: "SUCCESS",
    successful: true,
    task: "geordash.checks.gn_datadir.check_gn_meta",
    taskid: "7c53e842-b8b2-4f3a-8b66-200d38486a0f",
    value: {  problems: [ {
                path: "/mnt/geonetwork_datadir/data/metadata_data/00100-00199/199",
                size: "28.83 MB",
                type: "UnusedFileRes"
            }, {
                path: "/mnt/geonetwork_datadir/data/metadata_data/00100-00199/154",
                size: "107.07 KB",
                type: "UnusedFileRes"
            },  {
                path: "/mnt/geonetwork_datadir/data/metadata_data/00100-00199/152",
                size: "107.07 KB",
                type: "UnusedFileRes"
            },  {
                path: "Total",
                size: "29.04 MB",
                total: "166.05 MB",
                type: "UnusedFileResTotal"
            }
        ] }}

@landryb any ideas ?

the calls on the homepage assume that the job results come from a grouptask (which is the case for all other cards on the homepage), and thus loops over results to accumulate problems count in https://github.com/georchestra/gaia/blob/master/geordash/static/js/script.js#L23.

In your case the task is a single task, so value is a dict and not an array of results, and the js blows.

@landryb
Copy link
Member

landryb commented Sep 4, 2025

This is starting to look good ! i still see some hardcoded bits for the GN database name/host that should come from the new geonetwork.properties parser your rightfully added :)

maybe another cosmetic nitpick, but the whole Bytesize python class in the task looks out of place here, in my view the jobs should just return 'as raw as possible values' (eg an amount of bytes), and that's jinja's task for format it for human consumption.. quickly looking at the jinja docs, filesizeformat is the filter that should do it for you, eg {{ raw_value_in_bytes | filesizeformat }} in the template would allow to drop some code :)

@jeanmi151
Copy link
Collaborator Author

This is starting to look good ! i still see some hardcoded bits for the GN database name/host that should come from the new geonetwork.properties parser your rightfully added :)

maybe another cosmetic nitpick, but the whole Bytesize python class in the task looks out of place here, in my view the jobs should just return 'as raw as possible values' (eg an amount of bytes), and that's jinja's task for format it for human consumption.. quickly looking at the jinja docs, filesizeformat is the filter that should do it for you, eg {{ raw_value_in_bytes | filesizeformat }} in the template would allow to drop some code :)

can't make the filesizeformat function works,
getting "filesizeformat is not defined" error maybe I don't use it in the right place

@landryb
Copy link
Member

landryb commented Sep 12, 2025

This is starting to look good ! i still see some hardcoded bits for the GN database name/host that should come from the new geonetwork.properties parser your rightfully added :)
maybe another cosmetic nitpick, but the whole Bytesize python class in the task looks out of place here, in my view the jobs should just return 'as raw as possible values' (eg an amount of bytes), and that's jinja's task for format it for human consumption.. quickly looking at the jinja docs, filesizeformat is the filter that should do it for you, eg {{ raw_value_in_bytes | filesizeformat }} in the template would allow to drop some code :)

can't make the filesizeformat function works, getting "filesizeformat is not defined" error maybe I don't use it in the right place

it should work with the jinja2 v3.1.2 in debian12..

$python3 -c "import jinja2 ; print(jinja2.Template(\"{{ bytes|filesizeformat }}\").render(bytes=1572864))"
1.6 MB

and i dont think special imports are needed, eg if in home.html template i put {{ 50000|filesizeformat }} somewhere, it is correctly rendered as 50.0 kB

@jeanmi151
Copy link
Collaborator Author

all good for me, ready for reviews

@jeanmi151 jeanmi151 marked this pull request as ready for review September 12, 2025 08:12
Copy link
Member

@landryb landryb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i havent tested the code at runtime yet...

"dashboard", __name__, url_prefix="/gaia", template_folder="templates/dashboard"
)

def debug_only(f):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the use of this method ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a new wrap for the function tag use in line 78 for /debug route
is debug it enable answer the /debug content otherwise 404

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to keep this debug route, it was super useful to develop

parser.read_file(lines)
self.sections["geonetwork"] = parser["section"]

def tostr(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tend to not commit debug-only methods...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super useful for developing :)
(and debug ;) )
I wish to keep it

@landryb
Copy link
Member

landryb commented Oct 17, 2025

finally looking at it, i think geordash.checks.gn_datadir is missing from imports in celeryconfig.py.example so that the celery worker finds the job

@landryb
Copy link
Member

landryb commented Oct 17, 2025

will check monday, but it blew at runtime, i guess it didnt find the database/schema:

Unable to load celery application.
While trying to load the module make_celery the following error occurred:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UndefinedTable: relation "geonetwork.metadata" does not exist
LINE 2: FROM geonetwork.metadata

in my case iirc, geonetwork has its own database so the tables are in the public schema

@landryb
Copy link
Member

landryb commented Oct 20, 2025

@jeanmi151:

m = MetaData(schema=conf.get("geonetworkSchema"))

afaict, this geonetworkSchema variable doesnt exist in the datadir, be it docker-master or master branch. the docker-master branch has

jdbc.connectionProperties=currentSchema=geonetwork

which is another hack ? i don't have it on my instance (probably because i use the default public in its own db) so i dunno its usecase.. anyway the right var is in jdbc.schema

also, it should first use the jdbc.* vars in geonetwork.properties... while i see you went for the vars from default.properties (which are only a fallback, and im not even 100% sure geonetwork fallbacks to it if it doesnt find the jdbc.* ones in geonetwork.properties

oh and ... definition of the Metadata class harcodes __table_args__ = {"schema": "geonetwork"} ?

@landryb
Copy link
Member

landryb commented Oct 20, 2025

with the following diff, the job can apparently connect to the public schema of my geonetwork database (havent tried running the job yet), taking params from geonetwork.properties:

--- i/geordash/checks/gn_datadir.py
+++ w/geordash/checks/gn_datadir.py
@@ -33,16 +33,6 @@ import jinja2
 
 Base = declarative_base()
 
-# Define the Metadata model (example schema of a GeoNetwork metadata table)
-class Metadata(Base):
-    __tablename__ = "metadata"
-    __table_args__ = {"schema": "geonetwork"}
-    id = Column(Integer, primary_key=True)
-    uuid = Column(String, unique=True)
-    data = Column(Text)  # Metadata content (e.g., XML or JSON)
-    schemaid = Column(String)  # Metadata schema (e.g., ISO 19115)
-    isharvested = Column(Integer)
-
 def get_folder_size(folder):
     return sum(file.stat().st_size for file in Path(folder).rglob('*'))
 
@@ -61,11 +51,11 @@ class GeonetworkDatadirChecker:
     def __init__(self, conf):
         url = URL.create(
             drivername="postgresql",
-            username=conf.get("pgsqlUser"),
-            host=conf.get("pgsqlHost"),
-            port=conf.get("pgsqlPort"),
-            password=conf.get("pgsqlPassword"),
-            database=conf.get("pgsqlDatabase"),
+            username=conf.get("jdbc.username", "geonetwork"),
+            host=conf.get("jdbc.host", "geonetwork"),
+            port=conf.get("jdbc.port", "geonetwork"),
+            password=conf.get("jdbc.password", "geonetwork"),
+            database=conf.get("jdbc.database", "geonetwork"),
         )
 
         engine = create_engine(url)
@@ -73,12 +63,13 @@ class GeonetworkDatadirChecker:
         self.sessiono = self.sessionm()
 
         # Perform database reflection to analyze tables and relationships
-        m = MetaData(schema=conf.get("geonetworkSchema"))
+        m = MetaData(schema=conf.get("jdbc.schema", "geonetwork"))
         Base = automap_base(metadata=m)
         Base.prepare(
             autoload_with=engine,
-            name_for_collection_relationship=name_for_collection_relationship,
+#            name_for_collection_relationship=name_for_collection_relationship,
         )
+        Metadata = Base.classes.metadata
         self.allmetadatas = self.session().query(Metadata).all()
 
     def session(self):

i had to comment out the collection name thing otherwise it blew with conflicts, and i dont remember why i had to use it for mapstore.

@jeanmi151
Copy link
Collaborator Author

right, I took the wrong var for db connection
when it is in all the same database it was working
I did the change you ask but can't test it right now in docker (because of the current outage)

@jeanmi151
Copy link
Collaborator Author

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "geonetwork.metadata" does not exist
LINE 2: FROM geonetwork.metadata

that error is due to the hardcoding of geonetwork schema in the non-required Metadata class.

if i remove it (as in my diff above), then the next error is abot the relationship naming mapper:

sqlalchemy.exc.ArgumentError: Error creating backref 'guf_userfeedbacks' on relationship 'guf_userfeedbacks.guf_userfeedbacks': property of that name exists on mapper 'mapped class guf_userfeedbacks->guf_userfeedbacks'

so that's why i had to comment out name_for_collection_relationship in Base.prepare

Seems I found a way to correctly request the database in my last commit
Could you try it ?

@landryb
Copy link
Member

landryb commented Oct 22, 2025

so that's why i had to comment out name_for_collection_relationship in Base.prepare

Seems I found a way to correctly request the database in my last commit Could you try it ?

yeah, it works here now on 4.2.8.

now that i've been able to finally test it at runtime, i have a bit more remarks, some to fix, some which would be welcom improvements, and some for future work ? so here's a proper review:

  • the button is labeled check all geonetwork datadir configs now, obvious copypaste
  • the presentation of the results could be vastly improved by using a bootstrap table, for consistency with other pages
  • i don't think there's a point in repeating over and over Folder is useless & the full path to the datadir, eg we could only store the internal md id, and display the full datadir path at the top of the page only once, with its total size ?
Folder is useless '/data/webapps/geonetwork/data/metadata_data/98119700-98119799/98119718' with size ' 8.2 kB '
Folder is useless '/data/webapps/geonetwork/data/metadata_data/185357700-185357799/185357767' with size ' 87.9 kB '
Folder is useless '/data/webapps/geonetwork/data/metadata_data/185357700-185357799/185357773' with size ' 91.2 kB '
Folder is useless '/data/webapps/geonetwork/data/metadata_data/184383400-184383499/184383480' with size ' 44.0 MB '
In total '275.3 MB' could be saved on '1.3 GB'
  • why the quotes/spaces around the sizes ?
  • the display of the card on the homepage works fine (eg 380 errors found in red) but it could be nice to have the amount of entries checked like others
  • another thing that bothers me, you fetch the md list from the database at GeonetworkDatadirChecker object creation, which afaict only happens once at gaia startup. If some md is added/delete, i think that wont be taken into account, so maybe the self.allmetadatas = self.session().query(Metadata).all() line should be moved to the get_meta_list implem (and the class member removed, as now useless...)
  • at some point in the history of geonetwork, there used to be a files table that referenced all those files in the datadir, but now it seems its empty, and that info is partially available in the metadatafileuploads table (filename & metadataid). afaict your check mostly checks that the currently iterated folder id exists in the metadata table, right ?
  • if so, i fear that it will only detect leftover folders from removed metadatas, but wont detect files that are present in the private/public subdirs, but not actually referenced by the metadata..
  • to be fully complete, i think all files found in subfolders should be checked as present in the metadata xml , eg look for <FQDN>/geonetwork/srv/api/records/<uuid>/attachments/<file> in the gmd:graphicOverview & gmd:CI_OnlineResource xml tags ? and/or in the metadatafileuploads table ?

@jeanmi151
Copy link
Collaborator Author

  • the button is labeled check all geonetwork datadir configs now, obvious copypaste

okay

  • the presentation of the results could be vastly improved by using a bootstrap table, for consistency with other pages

I will try to do such thing but I am not super confident with front stuff

  • i don't think there's a point in repeating over and over Folder is useless & the full path to the datadir, eg we could only store the internal md id, and display the

okay

* why the quotes/spaces around the sizes ?

I removed them

* the display of the card on the homepage works fine (eg `380 errors found` in red) but it could be nice to have the amount of entries checked like others

Like total amount of folder ?

* another thing that bothers me, you fetch the md list from the database at `GeonetworkDatadirChecker` object creation, which afaict only happens once at gaia startup. If some md is added/delete, i think that wont be taken into account, so maybe the `self.allmetadatas = self.session().query(Metadata).all()` line should be moved to the `get_meta_list` implem (and the class member removed, as now useless...)

Okay I will correct that

* at some point in the history of geonetwork, there used to be a `files` table that referenced all those files in the datadir, but now it seems its empty, and that info is partially available in the `metadatafileuploads` table (`filename` & `metadataid`). afaict your check mostly checks that the currently iterated folder id exists in the `metadata` table, right ?

metadatafileuploads table is not complete enough, yes we are listing the metadata and folder and check for correspondance

* if so, i fear that it will only detect _leftover folders from removed metadatas_, but wont detect files that are present in the `private`/`public` subdirs, but not actually referenced by the metadata..

" it will only detect _leftover folders from removed metadatas" --> yes it is what we are aiming here
"the private/public subdirs, but not actually referenced by the metadata.." --> rigth but this is much more complexe, I was trying to start little on this one

* to be fully complete, i think all files found in subfolders should be checked as present in the metadata xml , eg look for `<FQDN>/geonetwork/srv/api/records/<uuid>/attachments/<file>` in the `gmd:graphicOverview` & `gmd:CI_OnlineResource` xml tags ? and/or in the `metadatafileuploads` table ?

we will keep that for futur improvments I think but it is a good idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants