Skip to content

Conversation

@AbeJellinek
Copy link
Member

@AbeJellinek AbeJellinek commented Jun 17, 2025

With a new type of persistent, non-modal, draggable popup.

This is a mostly-complete prototype; there are a few kinks we still need to work out.

  • Right now it reads your current selection if you have text selected when the popup opens, or the whole document if no text is selected. That's obviously not ideal, and I'd like to come up with a better set of interactions to control the reading position. Reading should probably start at the current position in the document. How should switching the reading position work? Do we want to add a "Read Aloud" context menu item that moves the reading position to the target element?
  • Apple doesn't expose the high-quality "Siri" voices to third-party software. The built-in macOS voices we can access are fairly antique, and the much more decent "Enhanced" and "Premium" voices have to be downloaded from the Spoken Content System Settings pane, which Apple totally buried in the System Settings redesign. We should think about how to point people there - I don't think a page on the wiki is good enough in this case, because people who try this feature and get one of the bad built-in voices are going to think, "This sounds terrible," and give up.
  • Web Speech implementations have a lot of bugs:
    • Most macOS voices are borderline unintelligible on >1x speed in Firefox. They sound fine in Chrome and Safari. I'd think they'd be using the exact same system APIs.
    • We're not using pause()/resume() because Firefox helpfully unpauses when your computer wakes from sleep, and Chrome doesn't honor pause() before any utterances are queued, along with other bugs. Instead, we have to cancel speaking on pause, which means restarting from the beginning of the line on unpause. I'm not happy about that!
    • Chrome has its own cross-platform "Google" voice options, which sound great... except they take a really long time to start playing, if they ever play at all, and don't fire the right events. I'd like to use them, because they really are good, but we might need to filter them out if we can't find workarounds for the major bugs.
    • Unclear if anyone has tested the WebKit implementation of this API. Safari claims that every single voice is the default voice for its language, and it returns duplicates. It also doesn't return the "Enhanced" or "Premium" macOS voices.
  • Voice selection can be persisted per language with a little glue code in the client.

This only implements Read Aloud for the EPUB and snapshot views, but it should be straightforward to implement for PDFs once we stabilize the API.

Closes zotero/zotero#5327, see discussion in zotero/zotero#5326

@yexingsha
Copy link

How should switching the reading position work?

My intuition is that every time you click "play" in the popup, it should check if you have text selected. If yes, then read the selected text; if no, then start from the beginning, or resume the read aloud in progress. But this raises the question of whether to switch reading position if the read aloud in progress was paused. I'm leaning toward yes, since if it doesn't, then the only way to move reading position would be via context menu, which kinda goes against why we have a popup in the first place.

Do we want to add a "Read Aloud" context menu item that moves the reading position to the target element?

I think yes, because it's still useful, and this functionality does need to be accessible via menu. I'm thinking you should be able to right click both with and without selection, and choose "Read Aloud" or "Read Selected Text", which also opens the popup if it's not open yet.

We should think about how to point people there - I don't think a page on the wiki is good enough in this case, because people who try this feature and get one of the bad built-in voices are going to think, "This sounds terrible," and give up.

I tried installing one of those premium voices and... yeah it's bad. Can we add an option in the dropdown that says "More Voices..." which directly opens that "Voice" system settings popup? It would also help if we add a help button that opens the wiki page, with a tooltip explaining Zotero can only access the non-Siri voices installed on your system.

Voice selection can be persisted per language with a little glue code in the client.

That's really good, and makes me think we should separate language out as an option, to not only shorten and simplify the list of voices, but also make it a bit clearer that voice selection is persisted per language.

Other issues:

  • I'm wondering if it makes more sense for the read aloud to start as soon as you click the button in the toolbar, which saves you an extra click on the Play button.
  • I think we should set the speed increment to 0.1 rather than 0.25. Since even the enhanced/premium voices on macOS don't sound all that natural, a jump from 1.00 to 1.25 or 0.75 feels very jarring.
  • Switching voices doesn't seem to work for me.
  • The paragraph highlight doesn't appear consistently at the moment, and conflicts with the actual cursor highlight. We can do what firefox does, which is give the entire active paragraph a background color, like accent-blue10.
  • Appearance issues:
    • Set the popup to padding: 8px; gap: 8px.
    • It's a small thing, but "1.00×" looks better with the multiplication symbol "×".
    • The chevron in the dropdown should use fill-secondary.

Updated popup layout:

read-aloud-popup

@AbeJellinek
Copy link
Member Author

My intuition is that every time you click "play" in the popup, it should check if you have text selected. If yes, then read the selected text; if no, then start from the beginning, or resume the read aloud in progress.

That works. Selected segments will probably be pretty short, so starting over each time isn't a big deal.

(It also would start over at the beginning of the paragraph anyway!)

Can we add an option in the dropdown that says "More Voices..." which directly opens that "Voice" system settings popup?

Sure. We can link to Spoken Content (GitHub won't link to this, but it's x-apple.systempreferences:com.apple.preference.universalaccess?SpokenContent), which is probably close enough.

makes me think we should separate language out as an option

In snapshots, definitely, since the page doesn't necessarily accurately set its lang attribute. We should be able to trust what EPUBs say their language is, although we'd want to at least allow variants (reading an en-US book in en-GB).

Switching voices doesn't seem to work for me.

Did you pull the latest version after the force-push? (d57f0f2, not 08e3bae.) I broke something in the initial commit but I haven't seen any issue with switching voices in the latest.

The paragraph highlight doesn't appear consistently at the moment

Do you have an example of a snapshot (I'm assuming) where this happens? Did it happen in focus mode? We can change the styling of the highlight, but that sounds like an issue with the node mapping, not the display.

I think yes, because it's still useful, and this functionality does need to be accessible via menu. I'm thinking you should be able to right click both with and without selection, and choose "Read Aloud" or "Read Selected Text", which also opens the popup if it's not open yet.

I'm wondering if it makes more sense for the read aloud to start as soon as you click the button in the toolbar

I think we should set the speed increment to 0.1 rather than 0.25

Appearance issues

All sound good, on it.

@AbeJellinek AbeJellinek force-pushed the read-aloud branch 2 times, most recently from e7bbe28 to 9cddb4c Compare June 20, 2025 19:28
@AbeJellinek
Copy link
Member Author

I've been thinking about how the reading "scope" (where Read Aloud should start and stop reading) should work. Here's one possibility:

  • If you open the popup with no selection, it starts reading from the first visible block of text on the page. So if you're scrolled down (snapshot) or turn the page (EPUB), it won't bring you back to the start of the document.

    Doing anything else would be very annoying IMO; if I open the Read Aloud popup in an EPUB that I'm halfway done with, I obviously don't want it to start reading from the front cover.

    • Same behavior if you right-click and choose Read Aloud without a selection, but it starts reading from the block you right-clicked rather than the top of the page.
  • If you open the popup with a selection, it reads your selection, then pauses. If you then unpause, it keeps reading further on in the document rather than starting from the beginning of your selection again. Skipping back/ahead similarly lets you exit the bounds of your selection if you want to.

    • Same behavior if you right-click and choose Read Aloud (Read Selection / Read Selected Text?) with a selection.

@yexingsha, thoughts?

@yexingsha
Copy link

This all sounds reasonable to me. Just to clarify, we can only start at the beginning of a block, right? So if the first visible block only has one line visible, we probably don't want to scroll back to where it begins and start from there. So does "the fist visible block of text on the page" mean the first block with a visible starting point? (Though if the visible block on the page starts before and ends after the current page, then I guess we have no choice but to scroll back and start from its beginning.)

Also, I'm still having trouble switching voices on the latest version. The read aloud also seems to start with a different voice, then after pause or restart, it switches over to the first voice in the list and gets stuck there. Switching language works. I get this error when I try to switch voices:

zotero(1)(+0045342): TypeError: this._onSetReadAloudVoice is not a function

    _handleReadAloudStateChange@resource://zotero/reader/reader.js:57898:12
    handleVoiceChange@resource://zotero/reader/reader.js:35279:13
    kj@resource://zotero/react-dom.js:223:217
    jj@resource://zotero/react-dom.js:34:117
    mj@resource://zotero/react-dom.js:34:171
    gh@resource://zotero/react-dom.js:62:95
    Xg@resource://zotero/react-dom.js:63:3
    Ce/<@resource://zotero/react-dom.js:72:334
    Tf@resource://zotero/react-dom.js:189:448
    wg@resource://zotero/react-dom.js:32:481
    Ce@resource://zotero/react-dom.js:65:220
    Be@resource://zotero/react-dom.js:47:64
    zj@resource://zotero/react-dom.js:46:351

JavaScript error: resource://zotero/reader/reader.js, line 57898: TypeError: this._onSetReadAloudVoice is not a function```

@AbeJellinek
Copy link
Member Author

Oh, you’re testing in the client! I’ll open a PR that adds the necessary code there. I’ve been testing in dev mode (in the browser) so far.

@AbeJellinek
Copy link
Member Author

zotero/zotero#5355 will prevent the error in the client.

@AbeJellinek
Copy link
Member Author

@dstillman, @yexingsha: This is ready to test. The remote voice is obviously a mock - we need to settle on a provider first.

The big caveat is that is that it's frankly pretty buggy in Chrome, but the bugs are all on the browser's end. Speech keeps playing after the tab is reloaded, sometimes speech doesn't work at all on any site until the browser is restarted, and the voice list takes a while to populate. That's in addition to the Google server-side voice issue I mentioned before. My focus is on making Read Aloud work well in the client, though. If Chrome is just too buggy, we don't have to enable it in the web library for now.

@AbeJellinek
Copy link
Member Author

(The mock remote voice won't work in the client because we explicitly block all remote resources there. But I think it would make sense to proxy requests through a zotero:// protocol handler extension anyway so the reader doesn't need to hold API keys.)

@yexingsha
Copy link

This is great! Issues I've encountered so far:

  • The background color of highlighted paragraphs feels a bit too strong; let's try blue-30 instead.
  • Voice and speed doesn't seem to persist across different documents.
  • When the only paragraph visible on the page starts from a previous page, read aloud will start at the beginning of the entire document, instead of the start of the paragraph.
  • Reading selected text works very well on its own. But after the selected text finishes reading, there doesn't seem to be a way to get the read aloud to "continue on", or skip to the previous or next paragraph, except for exiting read aloud and restarting it, which always starts on the first paragraph on the page. I'm not sure how common this use case is, but it feels quite awkward.
  • When read aloud is ongoing or paused, using context menu can get it to switch to read selected text, but pausing and restarting in the popup can't. Is this intended?

@AbeJellinek
Copy link
Member Author

Should all be fixed.

When read aloud is ongoing or paused, using context menu can get it to switch to read selected text, but pausing and restarting in the popup can't. Is this intended?

Yeah. I removed the pause behavior for the time being because it was just too unexpected IMO, and also caused a performance hit (because every selection change recalculated Read Aloud segments - though that could probably be worked around). You should be able to test it by checking out c470551.

I think it makes sense in concept, but in practice, it seems like it'll be fairly common to pause the audio, select a passage and annotate or copy it somewhere, then start playing again. I definitely would not want reading to restart from my selection position (and, worse, stop at the end of it!) if I did that.

What about a button in the popup that moves the reading position to the current selection instead?

@yexingsha
Copy link

Looks like there is something wrong with b86bd42. With it the read aloud won't start, and clicking on settings will bug out the entire reader interface and throw this error:

JavaScript error: resource://zotero/reader/reader.js, line 35568: TypeError: params.speed is undefined

Without this commit, the "starting incorrectly at beginning of document" and "no way to continue on after reading selected text" don't seem fixed either, but maybe there's something wrong with my build?

I think it makes sense in concept, but in practice, it seems like it'll be fairly common to pause the audio, select a passage and annotate or copy it somewhere, then start playing again. I definitely would not want reading to restart from my selection position (and, worse, stop at the end of it!) if I did that.

I tried the commit and see what you mean. Though I feel like we can avoid that by calculating the starting position when the play button is clicked, instead of when text selection changes. If I select something when the read aloud is paused, it shouldn't change the restart position immediately. If I then unselect the text and restart the read aloud, it should just continue from where it was paused. Only if I restart the read aloud when there is selected text should it switch position. Does that make sense?

But perhaps we do need some way to start read aloud at a specific position, not just for selected text, but also in the case where you want to skip ahead a lot of pages, or start on a paragraph that's very low on a page. I've been testing other text-to-speech programs, and Speechify (with their chrome plugin) handles this by adding a little play button to the left of each paragraph on hover. Do you think this is something that's worth exploring?

@AbeJellinek
Copy link
Member Author

Did you update zotero/zotero as well? To zotero/zotero@23042f1 (#5355)

@yexingsha
Copy link

I did, and it works fine if I drop b86bd42.

@AbeJellinek
Copy link
Member Author

Only if I restart the read aloud when there is selected text should it switch position. Does that make sense?

Oh, yeah, I can see that. Makes sense to me. What would we do about the active segment highlight, though? Should we hide it while reading is paused and there's a selection, to indicate that the selection will become the new reading target?

Speechify (with their chrome plugin) handles this by adding a little play button to the left of each paragraph on hover. Do you think this is something that's worth exploring?

Possibly! I think it would get annoying if we showed it all the time, though, so maybe just when Read Aloud is open?

@AbeJellinek
Copy link
Member Author

Oh, go into Zotero Settings -> Advanced -> Config Editor, search for reader.readAloudVoices and delete it. Should hopefully work after that.

@yexingsha
Copy link

It works now! Speech and speed is persisting as expected, and read aloud can now continue on smoothly after reading selected text. But:

  • Starting read aloud on a page where its only paragraph starts on a previous page still sends me to the beginning of the entire document, in both EPUB and snapshot.
  • Starting read aloud on selected text now doesn't work in EPUB, though snapshots are fine.

What would we do about the active segment highlight, though? Should we hide it while reading is paused and there's a selection, to indicate that the selection will become the new reading target?

I think the opposite: when someone pauses read aloud to annotate, the segment highlight should remain visible to indicate where the read aloud will continue from. Only if they restart the read aloud when there is a selection should the segment highlight change to that.

I think it would get annoying if we showed it all the time, though, so maybe just when Read Aloud is open?

Yeah, I agree.

@AbeJellinek
Copy link
Member Author

its only paragraph starts on a previous page

Ah, thanks, that's what I was missing when I was trying to reproduce this. What do we want to do in that case? We'd like to avoid changing the page when Read Aloud starts, but the best option here might just be to navigate back to the start of the half-cut-off paragraph if there's no fully visible paragraph to read from.

@yexingsha
Copy link

An alternative I can think of is to "pseudo select" the visible text on the page, so that the read aloud can start at the beginning of the page, and then continue on normally from there. Do you think that's possible, and would that be a better experience than navigating back to the start of the paragraph?

@AbeJellinek
Copy link
Member Author

I'll see how feasible it would be to find the first sentence boundary (if there is one) on the current page, and start reading from that. If it doesn't find one, it could navigate back one page as a last resort. I think that's relatively OK behavior.

@AbeJellinek
Copy link
Member Author

It was not too feasible (lots and lots of complicated and slow code for a relatively uncommon edge case), but I added a bunch of fallbacks to at least prevent it from navigating to the start of the document. I think it more or less works OK now? I'll revisit if we end up needing to calculate the first visible bit of text in snapshots for some other reason.

@yexingsha
Copy link

I tested it and it works pretty good. Is it feasible to scroll to the beginning of the paragraph when read aloud starts?

Reading selection in EPUB is also fixed, but I found another issue: when the selected text is at the end of a paragraph, read aloud will automatically continue on instead of stopping at the end of selection. This happens in both snapshot and EPUB.

@AbeJellinek
Copy link
Member Author

Is it feasible to scroll to the beginning of the paragraph when read aloud starts?

Yeah, definitely.

when the selected text is at the end of a paragraph, read aloud will automatically continue on instead of stopping at the end of selection

I was just working on that (and a related issue, where it can't figure out how to split on a segment at the end of a block of text).

@AbeJellinek
Copy link
Member Author

OK, how's it looking now?

I'm wondering if we might actually want to highlight individual lines, like we do for annotations. The full-paragraph highlight looked OK when it was usually a full paragraph being read, but now it looks kind of weird:

image

We could do something like the annotation style but with slightly rounded corners:

image

And maybe expanded by a pixel or two?

@yexingsha
Copy link

Works very well now!

The annotation style with rounded corners looks pretty good, but yes we should expand it horizontally by 2px, and set its height to line height so that multiple lines appear as an entire block, to make it more distinguishable from selection and annotation.

@abaevbog
Copy link
Contributor

I think I understand the problem with read aloud shortcuts. It would be nice to use simple shortcuts to pause/play/skip read aloud but all simple shortcuts are already taken.

It looks like space is already responsible for playing/stopping read aloud whenever the panel is open. Right/left arrows are natural candidates for skip ahead/back as well. One issue now is that if I focus any button in the reader and press Space, that button will not be clicked but read-aloud will pause/continue. Enter on buttons still works as expected but this conditional Space behavior doesn't feel quite right. The same would apply to left/right arrows if they were to be added.

A few ideas:

  1. Space should probably not always pause/un-pause read-aloud. If I focus the “X” or “Skip Ahead” buttons in the read-aloud panel and press Space, it's almost certain I mean to click the button. Maybe read aloud shortcuts only just apply when the actual reader content is focused?
  2. I agree with a concern that we should be careful with changing shortcuts in the reader depending on the context. It is, though, something that we already do - for instance in "Find in" popup. Enter/Shift-Enter navigate between search results but only when the focus is in the opened popup. With this, as long as it's clear what's happening, I think different arrow and space handling can be ok. Space to start/stop read aloud (from the document) makes sense to me. Then, arrow right/left could skip back/ahead. I don't know if it's necessary to differentiate between if read aloud is active or not for arrow handling. Maybe we skip back/ahead (and start reading, if paused) as long as the panel is visible? It's easier to remember and it also resembles screen readers' behavior.
  3. In the scenario above, we could use a simpler way to get to the popup to close it, if you want to go back to default non-read-aloud arrow right/left shortcuts. Should Escape stop read aloud? Or maybe a new shortcut like Cmd-Option-R? It would also allow you to quickly stop read-aloud before navigating to another page and starting read-aloud from a different location without having to tab to the 'X' in the panel.
  4. If we wanted to make it less likely for arrow and space shortcuts to overlap, we could also apply read-aloud shortcuts only as long as the panel is open. Currently, the panel is always there. What if you could collapse the panel without stopping read-aloud? In that case, we could override space/arrow handling of the document only as long as the panel is open.

As a separate issue, I don't think it's currently possible to tab to the voices config dropdowns and read-aloud speed slider from the skip back/play/skip ahead section of the popup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Focus Mode: Read aloud (text to speech) function

5 participants