Optimizing SD Card Writes

Data Logger Board with Rubber Ducky — The Data Logger Board does not actually use BRAIN as its central processing unit.

Everyone’s heard of black boxes – they are devices that record data in the event that something happens to assist in determining why something happened. Did you know that we have a similar device too – the Datalogger board!

You may be wondering where the BRAIN is, and the answer is, we didn’t use one. Initial testing with SD libraries available on the Internet showed insufficient performance. Additionally, there were concerns about overflowing the MCP2515’s 3-level deep receive buffer while the processor is busy doing SD Card writes. So, we chose the dsPIC33FJ128MC802 chip, which has a higher clock speed, integrated CAN module (with a 32-level deep buffer), and direct memory access to allow SPI transfers in the background. However, a future project would be to try the optimized code on a BRAIN and see if it will yield the necessary performance.

Of course, ours isn’t made to the same rigorous standards as aviation black boxes. Then again, we’re hoping we will never have to put their survivability to the test. Hopefully it will mainly be used for debugging in the debugging phase.

One of the major functions of the board is to log all relevant data (currently, anything flowing on the CANbus network) to a suitable medium (an SD card). Since it needs to do so at a reasonable pace, we need an optimized library.

Simple math shows that in the worst case scenario: for a 500kbit/s (which is what our CANbus will run at) input we need at least 62.5 kByte/s write speed. However, to be able to output in hex ASCII, the speed would need to be doubled, since one byte takes two characters, which gives us a minimum 125kByte/s. One available library on the Internet achieve 117kByte/s (http://www.roland-riegel.de/sd-reader/benchmarks/– benchmarked on 14MHz Arduino), however, those are all operations performed on the processor, which leaves little processor time for anything else. Another library, written for the multicore Propeller chip, achieves a whopping 1MByte/s – but programming the Propeller is a pain, having to learn the non-portable and relatively uncommon SPIN language. And we spent an entire day banging our head against a wall trying to debug a CANbus library for Propeller, which certainly left quite a sour taste. With those options gone, we chose to write our own library.

And we have progress! After spending 4 hours implementing the SD Card block write function, a quick round of benchmarks showed that it achieves a raw write speed of ~600kByte/s! Nice!

Data Logger Setup — Device under Test. Many thanks to Ryan for the logic analyzer!

What did we do to get that kind of performance?

Using the SD Card’s WRITE_MULTIPLE_BLOCK (CMD25) command – this seems to allow higher performance since the write delay after sending the data block is only incurred once per write operation rather than once per block as is the case with single block writes.
We pipelined SPI operations. Normally, one way to use SPI is to load a byte into the transmit buffer, wait for it to transmit, and then load the next byte. However, that is inefficient, since the processor isn’t doing much else during the transmit, so that time can be used to fetch the next byte from memory and load it into the transmit buffer so as soon as the current byte ends, the next one starts immediately. We didn’t exactly achieve that, but we did get around 2/3 efficiency, where efficiency is defined as the the time to transmit a byte divided by the time between bytes. This was a massive improvement over what we had before, which was around 1/2 or 1/3.
We manually inlined SPI transmit operations. Instead of calling a SPI_Transfer function, which blocks until the transmission completes and returns the received byte, we inlined and expanded it into the SPI block write function. This avoids the overhead of a function call and allows the pipelining scheme described above. However, this has implications for layering and readability.

Optimistically, we hope to push performance up to 1MByte/s and have most of the work done by the DMA module so the main processor is free to do whatever processing work is necessary. We may even be able to implement a CVR (Cockpit Voice Recorder – the other “black box” device)!

Then, possible future work includes:

Taking advantage of DMA – DMA (Direct Memory Access) allows memory to SPI transfers to happen in the background. Chances are, it would also have a SPI efficiency close to 1. However, limitations (at least on the current chip) include a maximum DMA RAM space of 2kBytes, out of the 16 total kBytes available RAM.
Possibly improving SPI pipeline efficiency by manually looking into assembly code. This is especially important, since Microchip charges $1000 to enable -O3 optimizations on their gcc-variant compiler. For a small segment of code, a human could probably do better anyways.
Increasing the processor frequency. Like overclocking your computer, but minus the smoke and flames. However, the chip maxes out at 80MHz input frequency, which translates to 40MIPS performance. SPI clock rate is further limited by the datasheet to 10Mbit/s.
Reliability. What use is the device if the code hits an infinite loop and dies?

Here is the testing methodology and setup:

dsPIC33FJ128MC802 with a 40MHz oscillator input (20MHz crystal + 2x PLL), giving 20 MIPS processor speed
10MHz (maximum) SPI clock rate
Writes done using CMD25, WRITE_MULTIPLE_BLOCK
Each write operation is 8192 bytes long, consisting of the repeating string, “Duck”.
There are 4 write operations, starting at raw byte addresses 0, 8192, 16384, and 25276. While these may seem like large numbers, they are in actuality a miniscule fraction of the card’s total capacity. We hope that write performance would be constant regardless of address. There’s really no good reason for that not to be true, anyways.
Time for a total write is the time, as shown by the logic analyzer, that the CS line is low. This averages about 12ms.
Write speed is the number of bytes written divided by the time to write, and for 8192 bytes in 12ms, that gives around 600kByte/s performance.