BETA CUDA interface: support for approximate mode and time-based APIs#917
Conversation
|
|
||
| static int CUDAAPI | ||
| pfnDisplayPictureCallback(void* pUserData, CUVIDPARSERDISPINFO* dispInfo) { | ||
| BetaCudaDeviceInterface* decoder = |
There was a problem hiding this comment.
Nit: I prefer auto when the expression on the right has the literal type we're getting on the left.
| parserParams.pfnSequenceCallback = pfnSequenceCallback; | ||
| parserParams.pfnDecodePicture = pfnDecodePictureCallback; | ||
| parserParams.pfnDisplayPicture = nullptr; | ||
| parserParams.pfnDisplayPicture = pfnDisplayPictureCallback; |
There was a problem hiding this comment.
This is the key difference, correct? That is, by registering this callback, we get the new behavior and can delete all of the relevant code?
There was a problem hiding this comment.
yes that's correct
| int BetaCudaDeviceInterface::frameReadyInDisplayOrder( | ||
| CUVIDPARSERDISPINFO* dispInfo) { | ||
| readyFrames_.push(*dispInfo); | ||
| return 1; // success |
There was a problem hiding this comment.
To clarify for my understanding, when the frameReadyInDisplayOrder callback is triggered, the parser has written the one frame that is next according to PTS in CUVIDPARSERDISPINFO?
Are the function signatures for this and other callbacks defined somwhere in documentation?
There was a problem hiding this comment.
Yes, your understanding is correct! The CUVIDPARSERDISPINFO struct contains two key fields:
torchcodec/src/torchcodec/_core/nvcuvid_include/nvcuvid.h
Lines 501 to 509 in 6377dfc
timestamp, which is the pts of the frame- the
picture_indexfield which uniquely identifies the frame. That's not the "frame index" as we have it in torchcodec, it's just an index internal to nvdec. That's what we then use here to "map" the frame:
torchcodec/src/torchcodec/_core/BetaCudaDeviceInterface.cpp
Lines 512 to 513 in 6377dfc
Are the function signatures for this and other callbacks defined somwhere in documentation?
Not in the docs, but in the headers:
torchcodec/src/torchcodec/_core/nvcuvid_include/nvcuvid.h
Lines 529 to 533 in 6377dfc
Strictly speaking, this is the callaback we're defining:
torchcodec/src/torchcodec/_core/BetaCudaDeviceInterface.cpp
Lines 49 to 53 in 6377dfc
It's a pure C function that calls the corresponding method on the Interface object. We have to do this gymnastic because the pure C callbacks have no notion of the Interface object.
This PR:
If we weren't relying on the NVCUVID callback, then we would have to solve both problems above ourselves, with codec-specific solutions. As a resut this PR also drastically simplifies future support for additional codecs - spoiler, I already added #919 and #920 for HEVC and AV1.
In #910, I described this design alternative and at the time, I thought it wasn't compatible enough with our
sendPacket() / receiveFrame()architecture. With #910 now merged as a minimal clean-ish skeleton of the interface, I can reason about this more clearly. And after spending a few days trying (and failing) to solve the frame-reordering problem for H264 only, I came to the conclusion that this solution, in this PR, is well worth it.This new simplified design does come with a minor trade-off. I explain it in a note, in the code.
Why is approximate mode and time-based APIs now supported? Let's first answer: why was approximate mode and time-based APIs not supported before? It was because
receiveFrame(avFrame, desiredPts)was only able to return a frame if we were able to find one with the exactdesiredPts. On approximate mode, we can't guarantee that desiredPts corresponds to a frame's pts, so there generally can't be a match. Same with time-based APIs: desiredPts may not correspond to where a frame starts.In this PR, we don't need that exact
desiredPtsmatching logic anymore. But we can still guarantee thatreceiveFramereturns frames in display order, so we got approximate mode and time-based support for free.