Performance issues on small dataset: any recommended configuration or tuning options? #3600

ddvlanck · 2025-11-18T15:26:47Z

ddvlanck
Nov 18, 2025

Hi! We are currently evaluating Apache Jena Fuseki by loading a dataset and running a set of SPARQL queries against it.
However, I’m running into performance issues that seem unexpected for the dataset size.

⚠️ Issue

We are testing with a dataset containing ~16 million triples, and we’ve configured a 60-second timeout for query responses in our experiment setup.
All SELECT queries—ranging from those expected to return around 10 rows up to those expected to return ~1M rows—hit the timeout limit. Even the queries with very small expected result sets fail to return within 60 seconds.
The system running Apache Jena Fuseki has 128 GB RAM available, so hardware limitations don’t appear to be the cause.

🧪 Setup

Fuseki version: created a local Docker build using Apache Jena 5.6.0
Deployment: Docker
Dataset size: 16,127,232 triples
Query type: SELECT
Operating system: Ubuntu 24.04.3 LTS

We updated the JAVA_OPTIONS to -Xms90g -Xmx90g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+ParallelRefProcEnabled -XX:+AlwaysPreTouch -XX:+UseStringDeduplication

💡 What we’re looking for

We’d like to understand whether there are configuration options, tuning parameters, or indexing strategies we should apply to improve query performance.
Any guidance on:

Recommended settings for larger datasets
Known performance limitations in Fuseki
Configuration flags that influence query planning or execution

…would be greatly appreciated.

Thanks in advance!

rvesse · 2025-11-19T10:21:20Z

rvesse
Nov 19, 2025
Collaborator

We updated the JAVA_OPTIONS to -Xms90g -Xmx90g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+ParallelRefProcEnabled -XX:+AlwaysPreTouch -XX:+UseStringDeduplication

Firstly don't set the heap size that high, most of the memory usage for TDB2 (assuming that is your database backend) is off-heap via memory mapped files so you want to leave plenty of memory for the OS to use for that.

I wouldn't typically go higher than 8GB heap at your scale (and even that is likely overkill unless your queries are very memory intensive)

https://jena.apache.org/documentation/tdb/faqs.html#java-heap

Dataset size: 16,127,232 triples

Often performance is data driven so if you can share your dataset that's great, if not (e.g. it's commercially sensitive) can you generate a sample dataset that shows the same performance problems with your queries that you can share?

Query type: SELECT

That's insufficient information for us to help you. SELECT is the main query type in SPARQL and has many possible clauses, without seeing some sample queries we can't tell whether your queries are the problem

Configuration flags that influence query planning or execution

There's some documentation on TDB optimisation at https://jena.apache.org/documentation/tdb/optimizer.html

The default query optimiser code is at https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/sparql/algebra/optimize/OptimizerStd.java and you can see the various ARQ configuration symbols that turn specific optimisations on/off in the code there

You may also find it helpful to use the qparse command from ARQ to compare the unoptimised and optimised algebras to see if ARQ is actually able to optimise your queries at all

0 replies

rvesse · 2025-11-19T10:24:28Z

rvesse
Nov 19, 2025
Collaborator

You can also view the various algebra forms of your queries via our online SPARQL Query Validator

0 replies

ddvlanck · 2025-11-20T19:52:50Z

ddvlanck
Nov 20, 2025
Author

Hi! Thanks a lot for the detailed explanation.

That’s very helpful — we’ll definitely lower the heap size accordingly to leave more room for off-heap memory. As for the queries, they all follow the same general structure:

PREFIX xsd:  <http://www.w3.org/2001/XMLSchema#>
PREFIX tm:   <https://movias.eu/ns/trafficMeasurements#>
PREFIX time: <https://movias.eu/ns/time#>

SELECT ?ts ?val 
FROM {{GRAPH}}
WHERE {
  ?obs a tm:TrafficMeasurement ;
       tm:TrafficMeasurement.measureLocation ?measureLocationId ;
       tm:TrafficMeasurement.result ?val ;
       tm:TrafficMeasurement.featureOfInterest ?featureOfInterest ;
       tm:TrafficMeasurement.vehicleType ?vehicleType ;
       tm:TrafficMeasurement.time ?timeNode .

  ?timeNode time:Time.dateTime ?ts .

  FILTER (xsd:dateTime(?ts) >= "{{FROM}}"^^xsd:dateTime &&
          xsd:dateTime(?ts) <= "{{TO}}"^^xsd:dateTime)

  FILTER (?featureOfInterest = <https://data.vlaanderen.be/id/concept/VkmVerkeersKenmerkType/aantal>)
  FILTER (?vehicleType       = <https://data.vlaanderen.be/id/concept/VkmVoertuigType/fiets>)
}
ORDER BY ?ts

During testing, the query pattern stays the same — only the temporal range changes between runs, starting at 15 minutes and gradually expanding up to 10 years. I’ve also made a small sample dataset publicly available so you can reproduce the issue: https://github.com/ddvlanck/traffic-measurements-synthetic-data. These are some of the ranges for which the query failed:

2020-05-28T13:47:07 - 2020-05-28T14:02:07
2015-05-12T02:55:38 - 2015-05-12T04:55:38
2020-01-02T08:21:29 - 2021-01-02T08:21:29
2015-01-01T00:14:35 - 2025-01-01T00:14:35

I’ll take a closer look at the TDB optimiser documentation and use qparse to inspect the optimised algebra, as you suggested.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance issues on small dataset: any recommended configuration or tuning options? #3600

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Performance issues on small dataset: any recommended configuration or tuning options? #3600

Uh oh!

ddvlanck Nov 18, 2025

Replies: 3 comments

Uh oh!

Uh oh!

rvesse Nov 19, 2025 Collaborator

Uh oh!

rvesse Nov 19, 2025 Collaborator

Uh oh!

Uh oh!

ddvlanck Nov 20, 2025 Author

ddvlanck
Nov 18, 2025

rvesse
Nov 19, 2025
Collaborator

rvesse
Nov 19, 2025
Collaborator

ddvlanck
Nov 20, 2025
Author