[!!!][TASK] Refactor indexing stack to unified TYPO3 core sub-requests#4559
Open
dkd-kaehm wants to merge 3 commits intoTYPO3-Solr:mainfrom
Open
[!!!][TASK] Refactor indexing stack to unified TYPO3 core sub-requests#4559dkd-kaehm wants to merge 3 commits intoTYPO3-Solr:mainfrom
dkd-kaehm wants to merge 3 commits intoTYPO3-Solr:mainfrom
Conversation
e8dbd61 to
0cb49b9
Compare
3a4a895 to
0f16557
Compare
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
sfroemkenjw
reviewed
Mar 11, 2026
6 tasks
7abcfa2 to
a2a5d14
Compare
Replace the split indexing architecture (HTTP-based PageIndexer for pages,
direct DB Indexer with FrontendAwareEnvironment for records) with a unified
sub-request pipeline where both pages and records go through
Application::handle() with in-process TYPO3 frontend sub-requests.
New architecture:
- IndexingInstructions: immutable value object replacing PageIndexerRequest
- IndexingResultCollector: singleton bridge between middleware and caller
- IndexingService: orchestrates sub-requests via Application::handle()
- SolrIndexingMiddleware: unified middleware handling indexRecords,
indexPage, and findUserGroups actions
- RecordFieldMapper: exposes AbstractIndexer field mapping publicly
Database change:
- Added item_pid column to tx_solr_indexqueue_item for grouping records
by page (pages: item_pid=uid, records: item_pid=pid)
IndexService now groups items by item_pid and delegates to IndexingService.
UserGroupDetector supports dual activation (legacy + new request attribute).
All existing events are preserved in their correct firing order.
Additional fixes:
- PHP 8.5 compatibility: sub-request URI must have a host (idn_to_ascii('')
throws ValueError since PHP 8.5). Throws SolrIndexRuntimeException if
site has no fully qualified base URL.
- Removed deprecated PageIndexer::isPageIndexable() (marked for v14 removal)
- Extracted filterSolrConnectionsByPageVisibility() into PagesRepository
to eliminate duplicate code between PageIndexer and IndexingService
- Test-only IndexingServiceForTesting subclass handles testing-framework's
FrontendUserHandler middleware context requirement
## Sequence Diagram
```mermaid
sequenceDiagram
participant S as Scheduler
participant IS as IndexService
participant IX as IndexingService
participant App as FrontendApplication
participant MW as SolrIndexingMiddleware
participant UG as UserGroupDetector
participant Solr
S->>IS: indexItems(limit)
IS->>IS: getItemsToIndex()
IS->>IS: groupItemsByPid()
loop For each item
IS->>IS: dispatch(BeforeItemIsIndexedEvent)
IS->>IX: indexItems([item])
alt Page item
IX->>IX: getPageSolrConnections()
loop For each language
Note over IX,UG: Step 1: findUserGroups
IX->>App: handle(request + findUserGroups)
App->>MW: process(request, handler)
MW->>MW: handler.handle() — render page
UG-->>UG: bypass access, collect fe_groups
MW->>IX: JsonResponse {userGroups: [0,2]}
loop For each user group
Note over IX,Solr: Step 2: indexPage
IX->>App: handle(request + indexPage)
App->>MW: process(request, handler)
MW->>MW: handler.handle() — render page HTML
MW->>MW: Builder::fromPage() → Document
MW->>MW: dispatch page document events
MW->>MW: processDocuments()
MW->>Solr: addDocuments(docs)
Solr-->>MW: 200 OK
MW->>IX: JsonResponse {success: true}
end
end
else Record item
IX->>IX: getRecordSolrConnections()
IX->>IX: resolvePageUid() → item_pid or rootPageUid
loop For each language
IX->>App: handle(request + indexRecords)
App->>MW: process(request, handler)
Note over MW: SHORT-CIRCUIT — no page render
loop For each item in batch
MW->>MW: getFullItemRecord() + overlay
MW->>MW: Builder::fromRecord() → Document
MW->>MW: addDocumentFieldsFromTyposcript()
MW->>MW: dispatch record document events
end
MW->>MW: processDocuments()
MW->>Solr: addDocuments(docs)
Solr-->>MW: 200 OK
MW->>IX: JsonResponse {success: true}
end
end
IX-->>IS: true
IS->>IS: updateIndexTimeByItem()
IS->>IS: dispatch(AfterItemHasBeenIndexedEvent)
end
IS->>IS: dispatch(AfterItemsHaveBeenIndexedEvent)
IS->>Solr: commit()
IS-->>S: true
```
---
### Follow-up to rebasing on "Speed-Up Integration tests"
This commit required changes after rebasing to the state:
* 867b19a (parallel Solr worker cores for paratest)
* 8c7a1a8 (run integration tests without processIsolation).
Following changes were required:
- Classes/Domain/Index/IndexService.php
- Remove unused $httpHosts property
- Replace incorrect @throws ConnectionException with accurate
declarations: ContainerExceptionInterface, NotFoundExceptionInterface,
InvalidConnectionException
- Update use statements accordingly
- Classes/Exception/SolrIndexRuntimeException.php
- Rename from RuntimeException to SolrIndexRuntimeException to prevent
side-effects with PHP's built-in \RuntimeException
- Classes/IndexQueue/IndexingService.php
- Update use statement and references to SolrIndexRuntimeException
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
a2a5d14 to
0454767
Compare
…onment Sub-requests via Application::handle() run in CLI context where the working directory is the project root, not the document root (public/). Third-party code relying on relative paths (e.g. BootstrapPackage SCSS cache checks using file_exists() with relative paths) fails silently, causing full SCSS recompilation on every sub-request. xHProf profiling revealed the impact: - Before fix: 331s per indexing run (94% spent in SCSS recompilation) - After fix: 10s per indexing run (95% faster) - SCSS was compiled 4x per run (once per sub-request) instead of 0x The fix adds chdir(Environment::getPublicPath()) around Application::handle() in executeSubRequest(), restoring the original CWD afterwards via try/finally. This ensures all sub-request code behaves identically to a real web request, regardless of third-party extension implementation details. With this generic fix, the legacy CliEnvironment class and the forcedWebRoot scheduler task option become obsolete and are removed: - Classes/System/Environment/CliEnvironment.php - Classes/System/Environment/WebRootAllReadyDefinedException.php - IndexQueueWorkerTask: CliEnvironment usage, forcedWebRoot property and related methods (getWebRoot, replaceWebRootMarkers) - IndexQueueWorkerTaskAdditionalFieldProvider: forcedWebRoot form field - locallang.xlf: forcedWebRoot label An integration test verifies that CWD is correctly restored after sub-request indexing and that indexing succeeds even when CWD does not match the public directory. Related: benjaminkott/bootstrap_package#1621 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…xingInstructions Remove the legacy HTTP-based PageIndexer system that has been fully replaced by the unified sub-request pipeline (IndexingService + SolrIndexingMiddleware). xHProf profiling confirmed zero active code paths. Deleted legacy classes: - IndexQueue/PageIndexer, PageIndexerRequest, PageIndexerResponse - IndexQueue/PageIndexerRequestHandler, PageIndexerDataUrlModifier - IndexQueue/FrontendHelper/Manager, FrontendHelper (interface) - IndexQueue/FrontendHelper/PageIndexer (event listener) - Middleware/PageIndexerInitialization Redesigned UserGroupDetector: - NEW: UserGroupDetectionMiddleware scopes findUserGroups activation via IndexingResultCollector (no Singleton, no $GLOBALS['TYPO3_REQUEST']) - Removed SingletonInterface, TCA manipulation, manual state management - Event listeners check ResultCollector.isUserGroupDetectionActive() - Uses TcaSchemaFactory for fe_group field lookup (no $GLOBALS['TCA'] hack) - IndexingResultCollector.finalizeUserGroups() deduplicates/sorts groups Migrated to IndexingInstructions: - FrontendGroupsModifier: solr.indexingInstructions instead of legacy - AuthorizationService: solr.indexingInstructions attribute check - SolrRoutingMiddleware: attribute check instead of X-Tx-Solr-Iq header - DebugWriter: attribute check instead of header - IntegrationTestBase: executePageIndexer() sets IndexingInstructions on InternalRequest, flowing through testing framework's Application::handle() — same path as production code Test migration: - NEW: AccessProtectedContentTest (7 scenarios via IndexService) - Migrated: FrontendHelper/PageIndexerTest (12 tests via executePageIndexer) TODOs: - Remove dead code from IndexingResultCollector (pageContent, itemResults, success, originalTca) - Move UserGroupDetector + AuthorizationService out of FrontendHelper/ - Add unit tests for finalizeUserGroups() - Profile new stack with xHProf Fixes: TYPO3-Solr#4598 Relates: TYPO3-Solr#4046, TYPO3-Solr#4347, TYPO3-Solr#2724, TYPO3-Solr#2161, TYPO3-Solr#3541, TYPO3-Solr#4350, TYPO3-Solr#4321 Relates: TYPO3-Solr#3909, TYPO3-Solr#4007, TYPO3-Solr#2617, TYPO3-Solr#2493, TYPO3-Solr#2578, TYPO3-Solr#2696 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
c3f6163 to
98e596e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Don't merge! WIP removed to run actions.
Replace the split indexing architecture (HTTP-based PageIndexer for pages, direct DB Indexer with FrontendAwareEnvironment for records) with a unified sub-request pipeline where both pages and records go through Application::handle() with in-process TYPO3 frontend sub-requests.
New architecture:
Database change:
IndexService now groups items by item_pid and delegates to IndexingService. UserGroupDetector supports dual activation (legacy + new request attribute). All existing events are preserved in their correct firing order.
sequenceDiagram participant S as Scheduler participant IS as IndexService participant IX as IndexingService participant App as FrontendApplication participant MW as SolrIndexingMiddleware participant UG as UserGroupDetector participant Solr S->>IS: indexItems(limit) IS->>IS: getItemsToIndex() IS->>IS: groupItemsByPid() loop For each item IS->>IS: dispatch(BeforeItemIsIndexedEvent) IS->>IX: indexItems([item]) alt Page item IX->>IX: getPageSolrConnections() loop For each language Note over IX,UG: Step 1: findUserGroups IX->>App: handle(request + findUserGroups) App->>MW: process(request, handler) MW->>MW: handler.handle() — render page UG-->>UG: bypass access, collect fe_groups MW->>IX: JsonResponse {userGroups: [0,2]} loop For each user group Note over IX,Solr: Step 2: indexPage IX->>App: handle(request + indexPage) App->>MW: process(request, handler) MW->>MW: handler.handle() — render page HTML MW->>MW: Builder::fromPage() → Document MW->>MW: dispatch page document events MW->>MW: processDocuments() MW->>Solr: addDocuments(docs) Solr-->>MW: 200 OK MW->>IX: JsonResponse {success: true} end end else Record item IX->>IX: getRecordSolrConnections() IX->>IX: resolvePageUid() → item_pid or rootPageUid loop For each language IX->>App: handle(request + indexRecords) App->>MW: process(request, handler) Note over MW: SHORT-CIRCUIT — no page render loop For each item in batch MW->>MW: getFullItemRecord() + overlay MW->>MW: Builder::fromRecord() → Document MW->>MW: addDocumentFieldsFromTyposcript() MW->>MW: dispatch record document events end MW->>MW: processDocuments() MW->>Solr: addDocuments(docs) Solr-->>MW: 200 OK MW->>IX: JsonResponse {success: true} end end IX-->>IS: true IS->>IS: updateIndexTimeByItem() IS->>IS: dispatch(AfterItemHasBeenIndexedEvent) end IS->>IS: dispatch(AfterItemsHaveBeenIndexedEvent) IS->>Solr: commit() IS-->>S: trueProfiling after "Sub-request CWD fix and CliEnvironment removal"
It was done for 57 pages from solr-ddev-site with Introduction package data.
Context
xHProf profiling of the indexing process revealed that 94% of indexing time was spent on SCSS recompilation by the BootstrapPackage.
The root cause: in CLI context, the working directory (CWD) is the project root (
/var/www/html/), not the document root (public/).Third-party code using
file_exists()with relative paths (e.g.typo3temp/assets/...) failed silently, causing repeated recompilation.The fix adds
chdir(Environment::getPublicPath())aroundApplication::handle()inexecuteSubRequest(),ensuring all sub-request code behaves identically to a real web request.
This makes the legacy
CliEnvironmentclass andforcedWebRootscheduler option obsolete.Related: benjaminkott/bootstrap_package#1621
Single-page profiling: before vs after
clearPageCachesFull run profiling: 57 pages indexed
Overall: 493.9s (8.2 min) for 228 sub-requests (4 per page)
Wall time breakdown (self time)
Top self-time functions
mysqli_stmt::executemysqli::prepareListenerProvider::getListenersForEventunserializeDoctrine\DBAL\SQL\Parser::parseGeneralUtility::makeInstanceFluid\StandardVariableProvider::getByPathMemory analysis
Memory allocation by category (self, positive only):
Note: High allocation numbers (e.g. Database 3.6 GB) reflect total bytes allocated across all calls, not retained memory.
These are balanced by corresponding deallocations (
mysqli_stmt::closefrees 1.1 GB,ContentObjectRenderer::setRequestfrees 3.1 GB).Solr-specific memory is minimal (15 MB self-allocated) — the indexing overhead is dominated by TYPO3 frontend rendering,
which is expected since each page is fully rendered via sub-request.
Key observations
getListenersForEvent(~4,100 per sub-request) is significant — this is TYPO3 core behavior.Garbage collection between pages
Global reclaim ratio: 99.3% — of 14,576 MB allocated (self, positive), 14,479 MB is freed.
However, there is a 1.36 MB net retention per page that accumulates across the indexing run:
indexPageItemnet retained (57x)Main cause:
FrontendTypoScriptFactory::createSetupConfigOrFullSetupretains ~17 MB/call (3,810 MB inclusive over 228 calls).The TypoScript AST is loaded from cache on each sub-request, but PHP-internal references accumulate and are not fully released.
Additional contributors:
SimpleFileBackend::require(opcode cache holds references),unserialize(cache data deserialization).This is TYPO3 core behavior, not Solr-specific. With the default
documentsToIndexLimit: 50, this is uncritical.For very large sites (>500 pages per scheduler run), the limit should be kept in a reasonable range.