Piziheng Embedded: Really grasp the Flash signal waveform to see the AHB read access situation under the FlexSPI peripheral of i.MXRT (full acceleration)

Piziheng Embedded: Really grasp the Flash signal waveform to see the AHB read access situation under the FlexSPI peripheral of i.MXRT (full acceleration)


Hello, everyone, I am the ruffian, a serious ruffian who engages in technology. Today Pi Ziheng will introduce to you the actual capture of Flash signal waveforms to see the AHB read access situation under i.MXRT's FlexSPI peripheral .

In the previous article, "AHB read access situation under i.MXRT FlexSPI peripheral (with prefetch) by actually capturing the Flash signal waveform", Pi Ziheng captured the corresponding Flash end of the AHB read access with Cache turned off but Prefetch turned on Timing waveform diagram, we know that FlexSPI's Prefetch function does improve Flash access efficiency to a certain extent, but the maximum AHB RX Buffer is only 1KB (for i.MXRT1050) and cannot be split into smaller granular buffers to cache different Flashes. The data at the address (for the same AHB master), so that the Prefetch mechanism does not significantly improve the access efficiency for the repeated Flash space access of multiple different small data blocks in the code.

In response to the inefficiency of frequent access to this discontinuous Flash address space, the ARM Cortex-M7 core provides a solution, that is, L1 Cache technology. Today, Di Ziheng will continue to test the Flash AHB read access situation when L1 Cache is turned on. (This article mainly focuses on D-Cache):

1. Cache function of Cortex-M7

For the Cortex-M series family (M0+/M3/M4/M7/M23/M33/M35P/M55), L1 Cache only exists on Cortex-M7 and Cortex-M55 cores. To put it bluntly, L1 Cache is designed for high The performance core is configured, and the current i.MXRT1xxx series microcontrollers are based on the Cortex-M7 core.

The following is the core system block diagram of i.MXRT1050, you can see that it integrates 32KB D-Cache, Cache is connected to SIM_M7 and SIM_EMS modules via AXI64 bus, and finally converted to AHB bus connected to FlexSPI module, so the AHB read access to Flash is Can be accelerated by D-Cache.

Regarding the working mechanism of D-Cache, you can find a detailed explanation in the ARM Cortex-M7 Processor Technical Reference Manual . A simple summary is that 32KB D-Cache will be divided into 1024 Cache Lines, each Cache Line is 32 bytes in size, four Cache Lines are a group (the so-called 4-way set associative), each group of Cache The line will have an address label, which is used to record the target address information of the data cached by the Cache.

When L1 D-Cache is enabled, there are two types of AHB read access to the target memory: Hit (the data to be accessed is in the Cache) and Miss (the data to be accessed is not in the Cache). There is nothing to say about Hit. Just fetch the data from the Cache; after Miss, it will first read the data from the target memory to the Cache, and then read the data from the Cache (this is called Read-Allocate, in fact there is another term Read-Through and Correspondingly, Read-Through is to read data directly from the target memory, which is generally the behavior when Cache is not enabled).

The Cache policy control of the target address space is mainly attribute configuration (in the kernel MPU module) and switch control (in the kernel SCB module). The following BOARD_ConfigMPU() function is typical of the Flash area allocated by the FlexSPI address mapping space Cache attribute configuration. In this code, the 64MB space attribute starting at 0x60000000 is configured as Normal Memory, not shared, Cache is enabled and the write access behavior is Write-Back (there is another strategy Write-Through for write access), read access The behavior does not need to be configured (Fixed Read-Allocate).

/* MPU configuration. */ void BOARD_ConfigMPU ( void ) { /* Disable I cache and D cache */ SCB_DisableICache(); SCB_DisableDCache(); /* Disable MPU */ ARM_MPU_Disable(); /* Region 0 setting: Instruction access disabled, No data access permission. */ MPU->RBAR = ARM_MPU_RBAR( 0 , 0x00000000 U); MPU->RASR = ARM_MPU_RASR( 1 , ARM_MPU_AP_NONE, 2 , 0 , 0 , 0 , 0 , ARM_MPU_REGION_SIZE_4GB); /* Region 2 setting: Memory with Device type, not shareable, non-cacheable. */ MPU->RBAR = ARM_MPU_RBAR( 2 , 0x60000000 U); MPU->RASR = ARM_MPU_RASR( 0 , ARM_MPU_AP_FULL, 2 , 0 , 0 , 0 , 0 , ARM_MPU_REGION_SIZE_512MB); # if defined(XIP_EXTERNAL_FLASH) && (XIP_EXTERNAL_FLASH == 1) /* Region 3 setting: Memory with Normal type, not shareable, cacheable, outer/inner write back. */ MPU->RBAR = ARM_MPU_RBAR( 3 , 0x60000000 U); MPU->RASR = ARM_MPU_RASR( 0 , ARM_MPU_AP_RO, 0 , 0 , 1 , 1 , 0 , ARM_MPU_REGION_SIZE_64MB); # endif /* Enable MPU */ ARM_MPU_Enable(MPU_CTRL_PRIVDEFENA_Msk); /* Enable I cache and D cache */ SCB_EnableDCache(); SCB_EnableICache(); } Copy code

Finally, I will mention the Cache-enabled write access behavior strategy that is irrelevant to the subject of this article:

  • (In the case of Hit) Write-Through mode: Write directly to the target memory and also update in the Cache (no data consistency problems caused by multiple master accesses, but no improvement in write access performance)
  • (In the case of Hit) Write-Back mode: The Cache line will be marked as dirty. When the line is invalidated, the actual write operation will be performed and the data in the Cache Line will be written to the target memory. (Improved write access performance, but there are hidden dangers. If the Cache hits, only the Cache is updated at this time, the target memory is not updated, and the data read by other Masters from the target memory is wrong)
  • (In the case of Miss) Write-Allocate: Load the data to be written into the Cache first, and then flush into the target memory.
  • (In the case of Miss) no-Write-Allocate: Write directly to the target memory.

2. D-Cache experiment preparation

Refer to the article "Flash signal waveform in real grasp of view i.MXRT of FlexSPI peripherals AHB read access case (no cache)" in the first section of experimental preparation , the experiment need to do the same preparations.

3. D-Cache experimental code

Refer to the article "real catch signal waveform view Flash AHB read access to the case under FlexSPI peripherals i.MXRT (no cache)" in the second section of the experiment code , this experiment code on the project and link files is the same as setting terms, But the specific test function is changed to the following ramfunc type function test_cacheable_read(). Regarding D-Cache, there will be many different tests this time. The system configuration before the while(1) statement remains unchanged. The statements in the while(1) can be adjusted according to the actual test situation:

# if (defined(__ICCARM__)) # pragma optimize = none __ramfunc # endif void test_cacheable_read ( void ) { //System configuration /* Disable L1 I-Cache*/ SCB_DisableICache(); /* Enable L1 D-Cache*/ SCB_EnableDCache(); SCB_CleanInvalidateDCache(); //Turn on/off the Prefetch feature of FlexSPI according to test requirements while ( 1 ) { //Test case code, can be adjusted according to the situation } } Copy code

In order to distinguish the data on IO[1:0] to help analyze the results of this series of test cases, we need to expand the special const data area. ahbRdBuffer is set as follows:

const uint8_t ahbRdBlock1[1024] @ ".ahbRdBuffer1" = { //positive order 0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13, 0x20, 0x21, 0x22, 0x23, 0x30, 0x31, 0x32, 0x33, //reverse order 0x33, 0x32, 0x31, 0x30, 0x23, 0x22, 0x21, 0x20, 0x13, 0x12, 0x11, 0x10, 0x03, 0x02, 0x01, 0x00, //Positive insertion order 0x01, 0x00, 0x03, 0x02, 0x11, 0x10, 0x13, 0x12, 0x21, 0x20, 0x23, 0x22, 0x31, 0x30, 0x33, 0x32, //Reverse insertion 0x32, 0x33, 0x30, 0x31, 0x22, 0x23, 0x20, 0x21, 0x12, 0x13, 0x10, 0x11, 0x02, 0x03, 0x00, 0x01, }; const uint8_t ahbRdBlock2[1024] @ ".ahbRdBuffer2" = { //Reverse insertion 0x32, 0x33, 0x30, 0x31, 0x22, 0x23, 0x20, 0x21, 0x12, 0x13, 0x10, 0x11, 0x02, 0x03, 0x00, 0x01, //Positive insertion order 0x01, 0x00, 0x03, 0x02, 0x11, 0x10, 0x13, 0x12, 0x21, 0x20, 0x23, 0x22, 0x31, 0x30, 0x33, 0x32, //reverse order 0x33, 0x32, 0x31, 0x30, 0x23, 0x22, 0x21, 0x20, 0x13, 0x12, 0x11, 0x10, 0x03, 0x02, 0x01, 0x00, //positive order 0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13, 0x20, 0x21, 0x22, 0x23, 0x30, 0x31, 0x32, 0x33, }; //in the project link file keep{ section .ahbRdBuffer1, section .ahbRdBuffer2 }; place at address mem:0x60002400 {readonly section .ahbRdBuffer1 }; place at address mem:0x60002800 {readonly section .ahbRdBuffer2 }; Copy code

4. D-Cache experiment results

4.1 Redo the experiment in the article No Cache

Now let us re- do all the experiments in the article "AHB read access situation under the FlexSPI peripheral of i.MXRT (no cache) by actually grasping the Flash signal waveform" with D-Cache enabled :

# define AHB_ADDR_START (0x60002400) void test_cacheable_read ( void ) { //Omit the system configuration (I-Cache, Prefetch off, D-Cache on) while ( 1 ) { SDK_DelayAtLeastUs( 10 , SystemCoreClock); for ( uint32_t i = 1 ; i <= 8 ; i++) { SDK_DelayAtLeastUs( 2 , SystemCoreClock); memcpy (( void *) 0x20200000 , ( void *)AHB_ADDR_START, i); } } } Copy code
4.1.1 AHB_ADDR_START value [0x60002400-0x60002418]

When the value range of AHB_ADDR_START is in [0x60002400-0x60002418], the timing waveform diagrams on the Flash side are all the same as the following. Because of the D-Cache, we can't see the periodic CS signal now, which means that except for the access to the new Flash address, the Flash must be read through the FlexSPI peripheral, and subsequent repeated accesses to the same Flash address are all direct It happened in D-Cache.

In addition, the starting cache address of D-Cache is always a 32-byte aligned address, and 32 bytes of data are cached at a time (because the size of D-Cache Line is 32 bytes), so in the waveform result, the starting address is always 0x60002400, and it is read once. Take 32byte data (stored in a D-Cache Line), so the waveform difference test results of accessing different aligned addresses caused by the AHB Burst Read strategy under D-Cache and Prefetch are not opened here.

4.1.2 AHB_ADDR_START = 0x60002419

When the Flash data to be read in the actual code will straddle two adjacent 32-byte aligned data blocks (0x60002400-0x6000241f, 0x60002420-0x6000243f), at this time there will be two CS valid signals on the Flash side, which are transmitted each time With 32byte data, D-Cache has been working continuously. This time two D-Cache Lines are used (the total size of D-Cache is 32KB, and there are 1024 Cache Lines in total), so we still can't see periodic CS signals on the Flash side. .

4.1.3 Additional experiment, read 1KB from 0x60002400

When the code reads 1KB of data in a loop, 32 valid CS signals can be seen on the waveform graph. Each CS valid period transmits 32bytes of data, totaling 1KB of data transmission. D-Cache dispatched 32 Cache Lines this time in Flash. At the end, we still can't see the periodic CS signal.

4.2 Redo the experiment in the article with prefetch

Now let us re- do all the experiments in the article "AHB read access situation under the FlexSPI peripheral of i.MXRT (with prefetch)" by actually grasping the Flash signal waveform with D-Cache enabled :

4.2.1 Cyclic read the data block of any length in the 1KB space aligned with the first address 32 bytes, the starting copy address is located in the first 31 bytes

In this case, the actual waveform on the Flash side is similar to the test result in 4.1 in "AHB read access situation under the FlexSPI peripheral of i.MXRT (with prefetch)" from the actual capture of the Flash signal waveform . . The Prefetch mechanism is used for the first layer of caching, and D-Cache gets the results in the Prefetch Buffer for secondary caching. The only difference is that because of the existence of D-Cache, the cache start address may change (from eight-byte alignment to 32-byte alignment):

# define PREFETCH_TEST_ALIGNMENT (7) //Possible values 0-31 # define PREFETCH_TEST_START (0x60002400 + PREFETCH_TEST_ALIGNMENT) uint32_t testLen = 0x1 ; //Possible values 1-(1KB-PREFETCH_TEST_ALIGNMENT) void test_cacheable_read ( void system ) { //Omit Configuration (I-Cache off, Prefetch on, D-Cache on) while ( 1 ) { memcpy (( void *) 0x20200000 , ( void *)PREFETCH_TEST_START, testLen); } } Copy code
4.2.2 Cyclic read data blocks larger than 1KB or 1KB data blocks whose first address is not aligned with 32 bytes

In this case, there will be two complete 1KB Prefetch operations on the Flash side. The first Prefetch operation reads 1KB at 0x60002400, and the second Prefetch operation reads 1KB at 0x60002800. Because of the existence of D-Cache, the second Prefetch operation has enough time to complete, and there is no need to insert an additional soft delay to prevent it from being interrupted by the next access requirement of the while(1) loop:

void test_cacheable_read ( void ) { //Omit the system configuration (I-Cache off, Prefetch on, D-Cache on) while ( 1 ) { memcpy (( void *) 0x20200001 , ( void *) 0x60002401 , 0x400 ); } } Copy code

4.2.3 Read two different data blocks cyclically (in two different 1KB spaces aligned with 32 bytes at the first address)

In this case, even if there is D-Cache, the Prefetch operation during the first CS (that is, memcpy((void *)0x20200000, (void *)0x60002400, 0x100); triggered) is still caused by the second CS Prefetch The operation is interrupted (ie memcpy((void *)0x20200400, (void *)0x60002800, 0x100);), but the Prefetch operation during the second CS will not be interrupted again, because the while(1) loops back The Flash data access requirements have been cached in D-Cache:

void test_cacheable_read ( void ) { //Omit the system configuration (I-Cache off, Prefetch on, D-Cache on) while ( 1 ) { memcpy (( void *) 0x20200000 , ( void *) 0x60002400 , 0x100 ); memcpy (( void *) 0x20200400 , ( void *) 0x60002800 , 0x100 ); } } Copy code

4.3 How to see periodic CS signal when D-Cache is enabled

After testing so many situations, is it possible for us to see periodic CS signals on the Flash side, that is, Flash is continuously read? Of course, we know that the total size of D-Cache is 32KB. As long as we cyclically copy more than 32KB of data, D-Cache will start to be unable to hold it. No, the following code will let us see the long-lost cycle timing waveform diagram ( Be careful, Flash will consume more power if it continues to work, haha).

void test_cacheable_read ( void ) { //Omit the system configuration (I-Cache off, Prefetch on, D-Cache on) while ( 1 ) { memcpy (( void *) 0x20200000 , ( void *) 0x60002400 , 0x8000 + 1 ); } } Copy code

So far, I have actually grasped the Flash signal waveform to see the AHB read access situation under the FlexSPI peripheral of i.MXRT. Di Ziheng has finished the introduction. Where is the applause~~~

Welcome to subscribe

The article will also be published to my blog Home Park , CSDN home , know almost home , micro-channel public number on the platform.

Search " Piziheng Embedded " on WeChat or scan the QR code below, and you can view it on your phone for the first time.