Skip to content

Conversation

@lfoppiano
Copy link
Member

@lfoppiano lfoppiano commented May 30, 2025

@kermitt2 I've started working on the new figure table extraction, but since I have rebased your initial branch on master I created a new branch, to keep the original, in case I break stuff.

From now on, I'm going to do incremental branches, this first PR implemented the SVG parsing and extraction.

There might be some problems with images that are going beyond their actual zone, which I did not find a way to exclude (checking for transparency, etc... did not help in these edge cases - any idea is welcome 👍 ):

image

(For reference this is Figure 1 SVG):

image

I'll try to post a few benchmarks for each new implementation so that we can track the progress.

Here are other images (of correctly identified figures) 😄 :

image image image

@lfoppiano lfoppiano changed the base branch from master to new-figure-table-models May 30, 2025 14:52
@lfoppiano lfoppiano changed the base branch from new-figure-table-models to master May 30, 2025 14:52
@lfoppiano lfoppiano marked this pull request as draft May 30, 2025 14:53
@lfoppiano lfoppiano force-pushed the new-figure-table-models2 branch from 45da2f2 to d23c92f Compare May 31, 2025 05:25
# Conflicts:
#	grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java
#	grobid-home/models/fulltext/model.wapiti
#	grobid-service/src/main/java/org/grobid/service/GrobidPaths.java
#	grobid-service/src/main/java/org/grobid/service/GrobidRestService.java
#	grobid-service/src/main/java/org/grobid/service/process/GrobidRestProcessFiles.java

Check warning

Code scanning / CodeQL

Information exposure through an error message Medium

Error information
can be exposed to an external user.
Error information
can be exposed to an external user.
Error information
can be exposed to an external user.

Copilot Autofix

AI 3 months ago

To fix the information exposure issue, we need to ensure that exceptions thrown from restProcessFiles.getFigures(inputStream) do not result in sensitive error messages reaching the REST client. The best way is to wrap this call in a try-catch block, log the details of the exception on the server (using the local logger), and return a generic error message to the client with an appropriate HTTP status code (such as 500 Internal Server Error). Only the generic message should go to the response. We will edit only the relevant method in GrobidRestService.java, add logging of the error (using SLF4J, already imported as logger), and suppress details in the client response. No changes to the rest of the code are necessary.


Suggested changeset 1
grobid-service/src/main/java/org/grobid/service/GrobidRestService.java

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/grobid-service/src/main/java/org/grobid/service/GrobidRestService.java b/grobid-service/src/main/java/org/grobid/service/GrobidRestService.java
--- a/grobid-service/src/main/java/org/grobid/service/GrobidRestService.java
+++ b/grobid-service/src/main/java/org/grobid/service/GrobidRestService.java
@@ -860,8 +860,16 @@
     @Produces(MediaType.APPLICATION_XML)
     @POST
     public Response getFiguresAndTables(
-        @FormDataParam(INPUT) InputStream inputStream) throws Exception {
-        return restProcessFiles.getFigures(inputStream);
+        @FormDataParam(INPUT) InputStream inputStream) {
+        try {
+            return restProcessFiles.getFigures(inputStream);
+        } catch (Exception ex) {
+            logger.error("Exception occurred while extracting figures and tables.", ex);
+            return Response.status(Response.Status.INTERNAL_SERVER_ERROR)
+                           .entity("An internal server error occurred while processing the request.")
+                           .type(MediaType.TEXT_PLAIN)
+                           .build();
+        }
     }
 
     @Path(PATH_CREATE_TRAINING)
EOF
@@ -860,8 +860,16 @@
@Produces(MediaType.APPLICATION_XML)
@POST
public Response getFiguresAndTables(
@FormDataParam(INPUT) InputStream inputStream) throws Exception {
return restProcessFiles.getFigures(inputStream);
@FormDataParam(INPUT) InputStream inputStream) {
try {
return restProcessFiles.getFigures(inputStream);
} catch (Exception ex) {
logger.error("Exception occurred while extracting figures and tables.", ex);
return Response.status(Response.Status.INTERNAL_SERVER_ERROR)
.entity("An internal server error occurred while processing the request.")
.type(MediaType.TEXT_PLAIN)
.build();
}
}

@Path(PATH_CREATE_TRAINING)
Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants