Skip to content

[!!!][TASK] Refactor indexing stack to unified TYPO3 core sub-requests#4559

Open
dkd-kaehm wants to merge 3 commits intoTYPO3-Solr:mainfrom
dkd-kaehm:task/refactor_indexing_stack
Open

[!!!][TASK] Refactor indexing stack to unified TYPO3 core sub-requests#4559
dkd-kaehm wants to merge 3 commits intoTYPO3-Solr:mainfrom
dkd-kaehm:task/refactor_indexing_stack

Conversation

@dkd-kaehm
Copy link
Copy Markdown
Collaborator

@dkd-kaehm dkd-kaehm commented Mar 5, 2026

Don't merge! WIP removed to run actions.

Replace the split indexing architecture (HTTP-based PageIndexer for pages, direct DB Indexer with FrontendAwareEnvironment for records) with a unified sub-request pipeline where both pages and records go through Application::handle() with in-process TYPO3 frontend sub-requests.

New architecture:

  • IndexingInstructions: immutable value object replacing PageIndexerRequest
  • IndexingResultCollector: singleton bridge between middleware and caller
  • IndexingService: orchestrates sub-requests via Application::handle()
  • SolrIndexingMiddleware: unified middleware handling indexRecords, indexPage, and findUserGroups actions
  • RecordFieldMapper: exposes AbstractIndexer field mapping publicly

Database change:

  • Added item_pid column to tx_solr_indexqueue_item for grouping records by page (pages: item_pid=uid, records: item_pid=pid)

IndexService now groups items by item_pid and delegates to IndexingService. UserGroupDetector supports dual activation (legacy + new request attribute). All existing events are preserved in their correct firing order.


sequenceDiagram
    participant S as Scheduler
    participant IS as IndexService
    participant IX as IndexingService
    participant App as FrontendApplication
    participant MW as SolrIndexingMiddleware
    participant UG as UserGroupDetector
    participant Solr

    S->>IS: indexItems(limit)
    IS->>IS: getItemsToIndex()
    IS->>IS: groupItemsByPid()

    loop For each item
        IS->>IS: dispatch(BeforeItemIsIndexedEvent)
        IS->>IX: indexItems([item])

        alt Page item
            IX->>IX: getPageSolrConnections()
            loop For each language
                Note over IX,UG: Step 1: findUserGroups
                IX->>App: handle(request + findUserGroups)
                App->>MW: process(request, handler)
                MW->>MW: handler.handle() — render page
                UG-->>UG: bypass access, collect fe_groups
                MW->>IX: JsonResponse {userGroups: [0,2]}

                loop For each user group
                    Note over IX,Solr: Step 2: indexPage
                    IX->>App: handle(request + indexPage)
                    App->>MW: process(request, handler)
                    MW->>MW: handler.handle() — render page HTML
                    MW->>MW: Builder::fromPage() → Document
                    MW->>MW: dispatch page document events
                    MW->>MW: processDocuments()
                    MW->>Solr: addDocuments(docs)
                    Solr-->>MW: 200 OK
                    MW->>IX: JsonResponse {success: true}
                end
            end

        else Record item
            IX->>IX: getRecordSolrConnections()
            IX->>IX: resolvePageUid() → item_pid or rootPageUid
            loop For each language
                IX->>App: handle(request + indexRecords)
                App->>MW: process(request, handler)
                Note over MW: SHORT-CIRCUIT — no page render
                loop For each item in batch
                    MW->>MW: getFullItemRecord() + overlay
                    MW->>MW: Builder::fromRecord() → Document
                    MW->>MW: addDocumentFieldsFromTyposcript()
                    MW->>MW: dispatch record document events
                end
                MW->>MW: processDocuments()
                MW->>Solr: addDocuments(docs)
                Solr-->>MW: 200 OK
                MW->>IX: JsonResponse {success: true}
            end
        end

        IX-->>IS: true
        IS->>IS: updateIndexTimeByItem()
        IS->>IS: dispatch(AfterItemHasBeenIndexedEvent)
    end

    IS->>IS: dispatch(AfterItemsHaveBeenIndexedEvent)
    IS->>Solr: commit()
    IS-->>S: true
Loading

Profiling after "Sub-request CWD fix and CliEnvironment removal"

It was done for 57 pages from solr-ddev-site with Introduction package data.

Context

xHProf profiling of the indexing process revealed that 94% of indexing time was spent on SCSS recompilation by the BootstrapPackage.
The root cause: in CLI context, the working directory (CWD) is the project root (/var/www/html/), not the document root (public/).
Third-party code using file_exists() with relative paths (e.g. typo3temp/assets/...) failed silently, causing repeated recompilation.

The fix adds chdir(Environment::getPublicPath()) around Application::handle() in executeSubRequest(),
ensuring all sub-request code behaves identically to a real web request.
This makes the legacy CliEnvironment class and forcedWebRoot scheduler option obsolete.

Related: benjaminkott/bootstrap_package#1621

Single-page profiling: before vs after

Metric Before fix After fix Improvement
Total wall time 331.5s 10.3s 97% faster
SCSS compilation 312.4s (4x compiled) 0.0s (cached) eliminated
clearPageCaches 2.6s (4 flushes) 0.0s eliminated
Sub-request total 330.3s 12.7s -317.6s

Full run profiling: 57 pages indexed

Overall: 493.9s (8.2 min) for 228 sub-requests (4 per page)

Wall time breakdown (self time)

Category Time Share
Fluid/Template rendering 104.5s 21%
Database (MySQL) 79.9s 16%
TypoScript processing 31.8s 6%
Solr HTTP requests 1.3s <1%
SCSS compilation 0.0s 0%
Other (PHP builtins, EventDispatcher, DI) ~277s 56%

Top self-time functions

Function Self time Calls
mysqli_stmt::execute 17.6s 45,177
mysqli::prepare 13.2s 45,114
ListenerProvider::getListenersForEvent 9.5s 935,410
unserialize 8.7s 7,139
Doctrine\DBAL\SQL\Parser::parse 8.6s 42,081
GeneralUtility::makeInstance 6.9s 809,575
Fluid\StandardVariableProvider::getByPath 6.6s 546,788

Memory analysis

Metric Value
Total allocated (inclusive) 97.4 MB
Peak memory 209.9 MB

Memory allocation by category (self, positive only):

Category Allocated
Database/Doctrine 3654 MB (churned, not retained)
Fluid/Rendering 851 MB
TypoScript 763 MB
DI/Container 41 MB
Cache 32 MB
Solr 15 MB
SCSS 0 MB

Note: High allocation numbers (e.g. Database 3.6 GB) reflect total bytes allocated across all calls, not retained memory.
These are balanced by corresponding deallocations (mysqli_stmt::close frees 1.1 GB, ContentObjectRenderer::setRequest frees 3.1 GB).

Solr-specific memory is minimal (15 MB self-allocated) — the indexing overhead is dominated by TYPO3 frontend rendering,
which is expected since each page is fully rendered via sub-request.

Key observations

  • Solr indexing logic is efficient: Solr HTTP requests take only 1.3s for 123 calls. Document creation and queue management are negligible.
  • EventDispatcher overhead: 935,410 calls to getListenersForEvent (~4,100 per sub-request) is significant — this is TYPO3 core behavior.
  • Database queries: ~45,000 prepared statements for 228 sub-requests (~198 queries per sub-request) — again TYPO3 core frontend rendering.
  • Sub-request average: 2.2s per sub-request, dominated by full page rendering.

Garbage collection between pages

Global reclaim ratio: 99.3% — of 14,576 MB allocated (self, positive), 14,479 MB is freed.

However, there is a 1.36 MB net retention per page that accumulates across the indexing run:

Metric Value
indexPageItem net retained (57x) 77.8 MB
Per page 1.36 MB
Projection for 100 pages ~136 MB
Projection for 500 pages ~682 MB
Projection for 1000 pages ~1,365 MB

Main cause: FrontendTypoScriptFactory::createSetupConfigOrFullSetup retains ~17 MB/call (3,810 MB inclusive over 228 calls).
The TypoScript AST is loaded from cache on each sub-request, but PHP-internal references accumulate and are not fully released.
Additional contributors: SimpleFileBackend::require (opcode cache holds references), unserialize (cache data deserialization).

This is TYPO3 core behavior, not Solr-specific. With the default documentsToIndexLimit: 50, this is uncritical.
For very large sites (>500 pages per scheduler run), the limit should be kept in a reasonable range.

@dkd-kaehm dkd-kaehm changed the title WIP !!! [TASK] Refactor indexing stack to unified TYPO3 core sub-requests !!! [TASK] Refactor indexing stack to unified TYPO3 core sub-requests Mar 5, 2026
@dkd-kaehm dkd-kaehm force-pushed the task/refactor_indexing_stack branch 2 times, most recently from e8dbd61 to 0cb49b9 Compare March 5, 2026 16:48
@dkd-kaehm dkd-kaehm changed the title !!! [TASK] Refactor indexing stack to unified TYPO3 core sub-requests [!!!][TASK] Refactor indexing stack to unified TYPO3 core sub-requests Mar 5, 2026
@dkd-kaehm dkd-kaehm force-pushed the task/refactor_indexing_stack branch 6 times, most recently from 3a4a895 to 0f16557 Compare March 9, 2026 11:58
@dkd-kaehm dkd-kaehm force-pushed the task/refactor_indexing_stack branch 4 times, most recently from 7abcfa2 to a2a5d14 Compare March 22, 2026 19:46
Replace the split indexing architecture (HTTP-based PageIndexer for pages,
direct DB Indexer with FrontendAwareEnvironment for records) with a unified
sub-request pipeline where both pages and records go through
Application::handle() with in-process TYPO3 frontend sub-requests.

New architecture:
- IndexingInstructions: immutable value object replacing PageIndexerRequest
- IndexingResultCollector: singleton bridge between middleware and caller
- IndexingService: orchestrates sub-requests via Application::handle()
- SolrIndexingMiddleware: unified middleware handling indexRecords,
  indexPage, and findUserGroups actions
- RecordFieldMapper: exposes AbstractIndexer field mapping publicly

Database change:
- Added item_pid column to tx_solr_indexqueue_item for grouping records
  by page (pages: item_pid=uid, records: item_pid=pid)

IndexService now groups items by item_pid and delegates to IndexingService.
UserGroupDetector supports dual activation (legacy + new request attribute).
All existing events are preserved in their correct firing order.

Additional fixes:
- PHP 8.5 compatibility: sub-request URI must have a host (idn_to_ascii('')
  throws ValueError since PHP 8.5). Throws SolrIndexRuntimeException if
  site has no fully qualified base URL.
- Removed deprecated PageIndexer::isPageIndexable() (marked for v14 removal)
- Extracted filterSolrConnectionsByPageVisibility() into PagesRepository
  to eliminate duplicate code between PageIndexer and IndexingService
- Test-only IndexingServiceForTesting subclass handles testing-framework's
  FrontendUserHandler middleware context requirement

## Sequence Diagram

```mermaid
sequenceDiagram
    participant S as Scheduler
    participant IS as IndexService
    participant IX as IndexingService
    participant App as FrontendApplication
    participant MW as SolrIndexingMiddleware
    participant UG as UserGroupDetector
    participant Solr

    S->>IS: indexItems(limit)
    IS->>IS: getItemsToIndex()
    IS->>IS: groupItemsByPid()

    loop For each item
        IS->>IS: dispatch(BeforeItemIsIndexedEvent)
        IS->>IX: indexItems([item])

        alt Page item
            IX->>IX: getPageSolrConnections()
            loop For each language
                Note over IX,UG: Step 1: findUserGroups
                IX->>App: handle(request + findUserGroups)
                App->>MW: process(request, handler)
                MW->>MW: handler.handle() — render page
                UG-->>UG: bypass access, collect fe_groups
                MW->>IX: JsonResponse {userGroups: [0,2]}

                loop For each user group
                    Note over IX,Solr: Step 2: indexPage
                    IX->>App: handle(request + indexPage)
                    App->>MW: process(request, handler)
                    MW->>MW: handler.handle() — render page HTML
                    MW->>MW: Builder::fromPage() → Document
                    MW->>MW: dispatch page document events
                    MW->>MW: processDocuments()
                    MW->>Solr: addDocuments(docs)
                    Solr-->>MW: 200 OK
                    MW->>IX: JsonResponse {success: true}
                end
            end

        else Record item
            IX->>IX: getRecordSolrConnections()
            IX->>IX: resolvePageUid() → item_pid or rootPageUid
            loop For each language
                IX->>App: handle(request + indexRecords)
                App->>MW: process(request, handler)
                Note over MW: SHORT-CIRCUIT — no page render
                loop For each item in batch
                    MW->>MW: getFullItemRecord() + overlay
                    MW->>MW: Builder::fromRecord() → Document
                    MW->>MW: addDocumentFieldsFromTyposcript()
                    MW->>MW: dispatch record document events
                end
                MW->>MW: processDocuments()
                MW->>Solr: addDocuments(docs)
                Solr-->>MW: 200 OK
                MW->>IX: JsonResponse {success: true}
            end
        end

        IX-->>IS: true
        IS->>IS: updateIndexTimeByItem()
        IS->>IS: dispatch(AfterItemHasBeenIndexedEvent)
    end

    IS->>IS: dispatch(AfterItemsHaveBeenIndexedEvent)
    IS->>Solr: commit()
    IS-->>S: true
```

---

### Follow-up to rebasing on "Speed-Up Integration tests"

This commit required changes after rebasing to the state:
* 867b19a (parallel Solr worker cores for paratest)
* 8c7a1a8 (run integration tests without processIsolation).

Following changes were required:
- Classes/Domain/Index/IndexService.php
  - Remove unused $httpHosts property
  - Replace incorrect @throws ConnectionException with accurate
    declarations: ContainerExceptionInterface, NotFoundExceptionInterface,
    InvalidConnectionException
  - Update use statements accordingly
- Classes/Exception/SolrIndexRuntimeException.php
  - Rename from RuntimeException to SolrIndexRuntimeException to prevent
    side-effects with PHP's built-in \RuntimeException
- Classes/IndexQueue/IndexingService.php
  - Update use statement and references to SolrIndexRuntimeException

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dkd-kaehm dkd-kaehm force-pushed the task/refactor_indexing_stack branch from a2a5d14 to 0454767 Compare March 26, 2026 11:52
dkd-kaehm and others added 2 commits March 26, 2026 18:46
…onment

Sub-requests via Application::handle() run in CLI context where the
working directory is the project root, not the document root (public/).
Third-party code relying on relative paths (e.g. BootstrapPackage SCSS
cache checks using file_exists() with relative paths) fails silently,
causing full SCSS recompilation on every sub-request.

xHProf profiling revealed the impact:
- Before fix: 331s per indexing run (94% spent in SCSS recompilation)
- After fix: 10s per indexing run (95% faster)
- SCSS was compiled 4x per run (once per sub-request) instead of 0x

The fix adds chdir(Environment::getPublicPath()) around
Application::handle() in executeSubRequest(), restoring the original
CWD afterwards via try/finally. This ensures all sub-request code
behaves identically to a real web request, regardless of third-party
extension implementation details.

With this generic fix, the legacy CliEnvironment class and the
forcedWebRoot scheduler task option become obsolete and are removed:
- Classes/System/Environment/CliEnvironment.php
- Classes/System/Environment/WebRootAllReadyDefinedException.php
- IndexQueueWorkerTask: CliEnvironment usage, forcedWebRoot property
  and related methods (getWebRoot, replaceWebRootMarkers)
- IndexQueueWorkerTaskAdditionalFieldProvider: forcedWebRoot form field
- locallang.xlf: forcedWebRoot label

An integration test verifies that CWD is correctly restored after
sub-request indexing and that indexing succeeds even when CWD does
not match the public directory.

Related: benjaminkott/bootstrap_package#1621

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…xingInstructions

Remove the legacy HTTP-based PageIndexer system that has been fully
replaced by the unified sub-request pipeline (IndexingService +
SolrIndexingMiddleware). xHProf profiling confirmed zero active code paths.

Deleted legacy classes:
- IndexQueue/PageIndexer, PageIndexerRequest, PageIndexerResponse
- IndexQueue/PageIndexerRequestHandler, PageIndexerDataUrlModifier
- IndexQueue/FrontendHelper/Manager, FrontendHelper (interface)
- IndexQueue/FrontendHelper/PageIndexer (event listener)
- Middleware/PageIndexerInitialization

Redesigned UserGroupDetector:
- NEW: UserGroupDetectionMiddleware scopes findUserGroups activation
  via IndexingResultCollector (no Singleton, no $GLOBALS['TYPO3_REQUEST'])
- Removed SingletonInterface, TCA manipulation, manual state management
- Event listeners check ResultCollector.isUserGroupDetectionActive()
- Uses TcaSchemaFactory for fe_group field lookup (no $GLOBALS['TCA'] hack)
- IndexingResultCollector.finalizeUserGroups() deduplicates/sorts groups

Migrated to IndexingInstructions:
- FrontendGroupsModifier: solr.indexingInstructions instead of legacy
- AuthorizationService: solr.indexingInstructions attribute check
- SolrRoutingMiddleware: attribute check instead of X-Tx-Solr-Iq header
- DebugWriter: attribute check instead of header
- IntegrationTestBase: executePageIndexer() sets IndexingInstructions
  on InternalRequest, flowing through testing framework's
  Application::handle() — same path as production code

Test migration:
- NEW: AccessProtectedContentTest (7 scenarios via IndexService)
- Migrated: FrontendHelper/PageIndexerTest (12 tests via executePageIndexer)

TODOs:
- Remove dead code from IndexingResultCollector (pageContent, itemResults,
  success, originalTca)
- Move UserGroupDetector + AuthorizationService out of FrontendHelper/
- Add unit tests for finalizeUserGroups()
- Profile new stack with xHProf

Fixes: TYPO3-Solr#4598
Relates: TYPO3-Solr#4046, TYPO3-Solr#4347, TYPO3-Solr#2724, TYPO3-Solr#2161, TYPO3-Solr#3541, TYPO3-Solr#4350, TYPO3-Solr#4321
Relates: TYPO3-Solr#3909, TYPO3-Solr#4007, TYPO3-Solr#2617, TYPO3-Solr#2493, TYPO3-Solr#2578, TYPO3-Solr#2696

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dkd-kaehm dkd-kaehm force-pushed the task/refactor_indexing_stack branch from c3f6163 to 98e596e Compare March 27, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants