The phonetic sequence (from Speech2IPA) is only one aspect of pronunciation. We also have tones and intonation, as well as different ways of accenting words using stress and/or pitch (see this blog). One attempt at a standardized notation for this is ToBI. Although just like IPA, the are variants (e.g., for Korean).
There are lots of different datasets and models for transcribing different aspects of this, e.g., English Lexical Stress (CNN, Transformer), English Intonation Mispronunciation, Pitch Accent Detection, Mandarin Pitch Accent, Prosodic Boundaries, Wav2ToBI, whether to combine or not combine with phoneme detection.
Would be great to list, compare, and evaluate a number of different approaches to assess where improvement is needed.