Skip to content

Conversation

@facchinm
Copy link
Member

Fixes #145

@iabdalkader
Copy link

Does this actually fix the issue? The lock just prevents other callers from accessing the device, and HOLD_ON_CS just makes it not release CS and not disable the SPI device after the transfer is complete.
I scanned the linked issue and the issue it references, quickly, and as far as I can tell there is no solution other than DMA.
If somehow this does help, I think hold-cs should probably be an option passed via settings (added to API etc..).

@mjs513
Copy link

mjs513 commented Dec 10, 2025

As @KurtE's issue pointed out this seems to be an open issue on Zephyr: zephyrproject-rtos/zephyr#73930

Has anyone looked at that or at @KurtE's solution? Not sure that is really going to resolve the issue

@KurtE
Copy link

KurtE commented Dec 10, 2025

Thanks @facchinm @iabdalkader @mjs513.

@facchinm - which boards did you try this on?

As I mentioned in the issue #145, in the past I have tried this approach to blanketly turn on the HOLD_ON_CS flag or
equivalent, which helps in some cases, in that at least on some boards, it then avoids the delay before release of CS and the delay after the CS is asserted. However on some boards/devices I ran into issues, where maybe it hung SPI, especially with multiple
devices and/or using the hardware cs pin(s). Some of these experiences may have been IMXRT specific, but I know on the
Teensy, several of our display drivers would not function properly with setting it globally, so our output code had
different helper functions to output with continue and output last byte.

As @mjs513 mentioned - there are some Zephyr open issues on some of this, that appear to have several interrelated issues and or things to try, like:

FIFO - is it enabled? Does it work.

Is DMA enabled?

Async Transfer - appears to speed things up, but ran into limits, like there may be a MAX number of bytes you can do
in a transfer. Also on the callback, can you reissue it with updated source pointer? ...

And my guess is, it may be different for each different processor.

But keeping my fingers crossed that this at least improves it.

Thanks

@KurtE
Copy link

KurtE commented Dec 13, 2025

Quick update: I tried out my sort of simple ST7796 driver on Q which does some hacks, to get hold of the underlying SPI
object

  uint32_t *p = (uint32_t *)_pspi;
  _spi_dev = (const struct device *)p[1];

And then does several fillRects with different colors and prints out how long it took...
Current released code, about 9962ms to output:

  tft.fillScreen(ST77XX_BLACK);
  tft.fillScreen(ST77XX_NAVY);
  tft.fillScreen(ST77XX_DARKGREEN);
  tft.fillScreen(ST77XX_DARKCYAN);
  tft.fillScreen(ST77XX_MAROON);
  tft.fillScreen(ST77XX_PURPLE);
  tft.fillScreen(ST77XX_OLIVE);
  tft.fillScreen(ST77XX_LIGHTGREY);
  tft.fillScreen(ST77XX_DARKGREY);
  tft.fillScreen(ST77XX_BLUE);
  tft.fillScreen(ST77XX_GREEN);
  tft.fillScreen(ST77XX_CYAN);
  tft.fillScreen(ST77XX_RED);
  tft.fillScreen(ST77XX_MAGENTA);
  tft.fillScreen(ST77XX_YELLOW);
  tft.fillScreen(ST77XX_WHITE);
  tft.fillScreen(ST77XX_ORANGE);
  tft.fillScreen(ST77XX_GREENYELLOW);
  tft.fillScreen(ST77XX_PINK);

Note: the fill code, uses temporary buffer that it fills with color and iterates calling spi_transcive....

    beginSPITransaction();
    setAddr(x, y, x + w - 1, y + h - 1);
    writecommand_cont(ST77XX_RAMWR);
    setDataMode();
#if 1
    uint32_t count_pixels = w * h;
    uint16_t array_fill_count = min(count_pixels, sizeof(s_row_buff)/sizeof(s_row_buff[0]));
    struct spi_buf tx_buf = { .buf = (void*)s_row_buff, .len = (size_t)(array_fill_count * 2 )};
    const struct spi_buf_set tx_buf_set = { .buffers = &tx_buf, .count = 1 };
    for (uint16_t i = 0; i < array_fill_count; i++) s_row_buff[i] = color;
    while (count_pixels) {
      spi_transceive(_spi_dev, &_config16, &tx_buf_set, nullptr);
      count_pixels -= array_fill_count;
      if (count_pixels < array_fill_count) {
        array_fill_count = count_pixels;
        tx_buf.len = (size_t)(array_fill_count * 2 );
      }
    }

#else  
    uint16_t color_swapped = (color >> 8) | ((color & 0xff) << 8);
    for (y = h; y > 0; y--) {
      #if 1
      for (uint16_t i = 0; i < w; i++) s_row_buff[i] = color_swapped;
        _pspi->transfer(s_row_buff, w * 2);
      #else
      for (x = w; x > 1; x--) {
        writedata16_cont(color);
      }
      writedata16_cont(color);  // was last
      #endif
    }
#endif
    endSPITransaction();
  }
  // printf("\tfillRect - end\n");

With the beginTransaction code change, this code now appears to hang.
If in the above I change the #if 1 to 0 to use the transfer function, it does not hang...
However the timing goes up to: 32842
or about 3.3 times slower.

And using SPI->transfer16... (I could try again, but that was potentially even slower)... And maybe has other issues.

EDIT: In case anyone wishes to play along.
The ST77xx library is up at: https://github.com/KurtE/Arduino_GIGA-stuff/tree/main/libraries/ST77XX_zephyr
Test sketch: https://github.com/KurtE/Arduino_UNO_Q/tree/main/Test%20Sketches/fillrect_test

@iabdalkader
Copy link

iabdalkader commented Dec 13, 2025

@KurtE Thanks for taking the time to test this. Could you please provide a brief summary? Does the change in this PR work or not? If it works, does it improve the performance?

About your custom code, assuming you're using a dev core, not a release, you could just add another overload to the SPI library, maybe something like:

void transfer(void *buf, size_t count, size_t datasize);

void arduino::ZephyrSPI::transfer(void *buf, size_t count, datasize) {
	int ret = transfer(buf, count, datasize==8 ? &config : &config16);
	(void)ret;
}

This should let you use our library to rule out any issues with your code. We could probably commit that, or later add data size to settings in the API.

@KurtE
Copy link

KurtE commented Dec 13, 2025

@KurtE Thanks for taking the time to test this. Could you please provide a brief summary? Does the change in this PR work or not? If it works, does it improve the performance?

Currently it hangs with my own code. Now if I call with only SPI.transfer call, it did not hang, With the version I have that runs, using the SPI.transfer(buffer, size); Where I use temporary buffer, copy in
the pixels, swapping the bytes, and output. If I run that WITHOUT the changes in this PR,
The times appear to be about 32853 versus with it 32842, so not much of an improvement in this case:
Maybe about .034%

May try some other simpler sketch to get more details

About your custom code, assuming you're using a dev core, not a release,
Yes currently for this test I used current source code build (as of sync yesterday) Only other change I had was my
defer camera startup code (different PR), that simply have the camera objects marked as init deferred, and then have
the camera startup code, only start the PWM clock IF the user uses camera library and then it calls the init of DCMI and then
INIT of the actual camera... So should not impact anything else.

But could do with released code as changes is only in SPI library.

you could just add another overload to the SPI library, maybe something like:

void transfer(void *buf, size_t count, size_t datasize);

void arduino::ZephyrSPI::transfer(void *buf, size_t count, datasize) {
	int ret = transfer(buf, count, datasize==8 ? &config : &config16);
	(void)ret;
}

This should let you use our library to rule out any issues with your code. We could probably commit that, or later add data size to settings in the API.

Could, personally I really really wish there was another API, that allows for write only and/or two buffers, write and read (can be NULL), that allows you to for example output directly out of a buffer without having it's contents overwritten.

Also still wonder about hardware. Fifo?

@iabdalkader
Copy link

If I run that WITHOUT the changes in this PR, The times appear to be about 32853 versus with it 32842, so not much of an improvement in this case: Maybe about .034%

@KurtE Thanks for confirming. This was expected as the problem, as you and others noted, seems to be with inter-byte gaps, not with whole transfers. This change might still be helpful if you're transferring a single byte at a time, though I'm not sure if it would make sense to hold CS in that case.

personally I really really wish there was another API, that allows for write only and/or two buffers, write and read (can be NULL), that allows you to for example output directly out of a buffer without having it's contents overwritten.

I agree, the API is lacking, to say the least, but we can try to improve it. We could start by adding a data size arg somewhere.

@KurtE
Copy link

KurtE commented Dec 13, 2025

@iabdalkader @mjs513 @facchinm @pillo79 - Wondering if we should continue this here, or back to our original
issue: #145 and/or create several new issues both here and/or Zephyr.

Note: I am mainly throwing darts here, and it has been a while since I really played with trying to improve the SPI speed.

There are probably several different things that should be looked at and steps to resolve them, including:

Are we properly configuring the SPI ports at the hardware level. I believe that there have been changes made in Zephyr over
the last several months, that allow some of the timing to be configured.

For those processors with FIFO on SPI, are they properly configured and software work with it?

I wish there was an API part of the SPI object that returns zephyr specific information, like handle to the underlying Zephyr SPI object. Which allows apps/libraries to easier directly call the zephyr methods.

Transfer APIs: allow you to set word size, write only transfers, maybe Async...

Fine tunning of SPI and drivers. At least earlier with these drivers, I was finding that for things like
setting up the rectangle to draw in the code has:

  void setAddr(uint16_t x0, uint16_t y0, uint16_t x1, uint16_t y1)
      __attribute__((always_inline)) {
    //if ((x0 != _x0_last) || (x1 != _x1_last)) 
    {
      writecommand_cont(ST77XX_CASET); // Column addr set
      writedata16_cont(x0+_xstart);             // XSTART
      writedata16_cont(x1+_xstart);             // XEND
      _x0_last = x0;
      _x1_last = x1;
    }
    //if ((y0 != _y0_last) || (y1 != _y1_last)) 
    {
      writecommand_cont(ST77XX_RASET); // Row addr set
      writedata16_cont(y0+_ystart);             // YSTART
      writedata16_cont(y1+_ystart);             // YEND
      _y0_last = y0;
      _y1_last = y1;
    }
  }

Earlier I found that in the case here where we output <8 bit> <16 bit> <16 bit> <8 bit> <16 bit> <16 bit><8 bit>
items, that there is an overhead of switching the underlying SPI from 8 bit mode to 16 bit mode and then switching
back. And at the time, it was faster to output the two 16 bit values as 4 8 bit values... But if you are going to output
several 16 bit values, then it is faster to switch. (How many outputs does it take to tip it? Don't know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SPI Transfer speed - and gaps...

5 participants