Pietrzak Roman
Kemu Studio - yosh.ke.mu
First version: 2015
Last changes: 2015.11.18
All rights reserved

Raspberry Pi: different methods of getting time

I was working on some pure-C user-space code for Raspberry Pi (project nodected), which strongly depends on timings, so need to measure time frequently.
So I asked some questions:
- how fast are functions used for time measurement ?
- what's the fastest way to measure time and what is an influence of time-measurement call to the timing itself (like syscalls getting to kernel-level) ?
- what time measurement accuracy can we expect ?

Test conditions and platform

I tried all code on Raspberry Pi, Revision 000f (from /proc/cpuinfo).
Running Raspbian (Linux raspberrypi 3.12.28+ #709).
The code itself is compiled without optimization. Running test few times gives very similar results.

The gettime functions - default approach

Intro to API

There are two typical methods to measure time on ANSI-C/linux based code:

#include <sys/time.h>
int gettimeofday(struct timeval *tv, struct timezone *tz);

and

#include <time.h>
int clock_gettime(clockid_t clk_id, struct timespec *tp);

The clock_gettime() may be used in 4 different modes - using one of 4 different clocks.

The gettimeofday() returns structure with micro-seconds inside, while clock_gettime() returns structure with nano-seconds inside.
There is also number-of-seconds variable in both structures, so both structures are quite "heavy" for just getting monotonic time.

Timing

Below is a quick result of testing all possible variants of these calls, by calling each function 1M times:

Test of 1000000 calls to clock_gettime(CLOCK_REALTIME) done in 0.806555[s].
Test of 1000000 calls to clock_gettime(CLOCK_MONOTONIC) done in 0.793926[s].
Test of 1000000 calls to clock_gettime(CLOCK_PROCESS_CPUTIME_ID) done in 1.821785[s].
Test of 1000000 calls to clock_gettime(CLOCK_THREAD_CPUTIME_ID) done in 1.604903[s].
Test of 1000000 calls to gettimeofday() done in 0.782640[s].

Code was a simple loop:

unsigned int cnt = 1000000;
for (unsigned int i = 0; i < cnt; ++i)
{
    clock_gettime(clock_type, &time_test_clock_gettime);
}

Accuracy

I wanted to run the code in simple loop (pseudo-code)

nextCallTime = getSomeTime() + x;
while (1)
{
    now = getSomeTime();
    if (now > nextCallTime)
    {
        doSomething();
        nextCallTime = now + x;
    }
}

My code was supposed to control external GPIO, where I had the sampler at the other side of GPIO running at accuracy of 50Mhz.
Initially I wanted to have accuracy of at least 1 microsecond, which would be 50 cycles of above sampler.

After some testing, the clock_gettime(CLOCK_MONOTONIC) was the best one. However it was still floating a lot:

PROBLEM A: The structured filled by a function call gives nanoseconds. However, during initial tests, there was always "jump" between calls divisible by 1000. This suggested the clock step below the ground is 1 microsecond (to be verified later !)
PROBLEM B: The response of the above loop on GPIO was with accuracy of +-900 ns - mostly late by random time (0 - 1000 ns). The sampler was reporting +-50 cycles accuracy (at 50 Mhz), to much error for my needs
PROBLEM C: From time to time (each few microseconds) I could see longer delays (a lot more than 10 us).
Simply, there was a high chance of my process loosing CPU context - I think because both the clock_gettime() and gettimeofday() are system-calls.

The quick call to clock_getres() - returns the resolution of 1ns, but it seems the 1ns resolution is not a case here.

Timers accuracy explained

The error of measurement itself (problem B)

Important: The text in this chapter assumes that clock step "below the ground" is 1 microsecond, even if structure reports nanoseconds - this is verified later in the next chapter.

Picture below shows the expected time measurement timing (almost ideal):

The clock_gettime() call takes some time to be made, but hopefully it returns with small delay (red rectangle) - so, we're little bit late, but the accuracy is not that bad.

However, the clock_gettime() call is obviously made of few steps inside, which involve calling kernel, getting some "pure number" time from hardware, doing math around it to fill the structure etc.
The 2nd picture shows bad case:

The call itself takes roughly about 700-800 ns, and somewhere in the middle it takes the real time value from HW, so the number returned by gettime() function is ALREADY DELAYED.
Take a look on a 2nd getTime() call on above picture: when it is called, it measures the time before first 1us happened, but returns after 1us is already there.
Keep in mind - the step of timer is 1 us, so whatever delay there is of the call itself - the function reports 0us time.
We need one more call to get the correct time - which again takes 700-800ns, so we may endup in the 2us.
The situation is - we missed 2 us !

This explaines the existence of PROBLEM B

The clock_gettime() implementation

Let's dig into some kernel code to find how it is done.
I'm not the kernel guy, so it took me a while to understand where to find all the bits... My understanding is (please correct me if I'm wrong):

The Raspberry Pi kernel source code tree can be cloned from this GIT repo: https://github.com/raspberrypi/linux.git
clock_gettime() is a syscall defined in posix timers implementation (kernel/posix-timers.c) - just few lines of code calling clock_get() from k_clock struct
ktime_get() is a part of kernel timekeeping code (kernel/time/timekeeping.c) - simple while{} loop reading sequence from timekeeper struct and some math playing with sec/nsecs
timekeeper getting the time from clocksource (kernel/time/clocksource.c)
bcm2708 (arch/arm/mach-bcm2708/bcm2708.c) driver registers to clocksource through clocksource_register_hz() calling __clocksource_register_scale()
clocksource calls reclksrc_read() in BCM driver to get the time value
the value is copied from (ST_BASE + 0x04) register, which is CLO (System Timer Counter Lower 32 bits in System Timer Registers),
the System Timer Counter is declared to be 1 Mhz

The interesting parts are:

The clock/timer itself is 1 Mhz, so the timing is 1 microsecond based anyway. Less-than-1-us accuracy is not possible using that timer.
It may be beneficial to skip all the system-call stuff, the math around it, and getting through all the layers to get pure tick count.

Getting direct ticks

Now, the idea is simple: let's take the time from a clock directly
Simplest way to do that is to open /dev/mem and map the sys-clock registers manually:

#define BCM2708_ST_BASE 0x20003000 /* BCM 2835 System Timer */

volatile unsigned *TIMER_registers;

unsigned int TIMER_GetSysTick()
{
    return TIMER_registers[1];
}

void TIMER_Init()
{
    /* open /dev/mem */
    int TIMER_memFd;
    if ((TIMER_memFd = open("/dev/mem", O_RDWR/*|O_SYNC*/) ) < 0)
    {
        printf("can't open /dev/mem - need root ?\n");
        exit(-1);
    }

    /* mmap BCM System Timer */
    void *TIMER_map = mmap(
        NULL,
        4096, /* BLOCK_SIZE */
        PROT_READ /*|PROT_WRITE*/,
        MAP_SHARED,
        TIMER_memFd,
        BCM2708_ST_BASE
    );

    close(TIMER_memFd);

    if (TIMER_map == MAP_FAILED)
    {
        printf("mmap error %d\n", (int)TIMER_map);
        exit(-1);
    }
    TIMER_registers = (volatile unsigned *)TIMER_map;
}

Now call TIMER_Init() in the the inits of your code, and call TIMER_GetSysTick() to get the ticks value.
The obvious thing here is - we need root to access /dev/mem, however in my case the process was in root space anyway (accessing GPIO and some other stuff).

With that implementation I restarted the timing tests (the 1M calls test - look above):

Test of 1000000 calls to getSysTick() done in 0.109953[s].

The stability of high-res-time dependent code is now really what I needed:

we've got around 110 ns call instead of 800 ns to measure time
we're no longer getting values through system-call, so less chance for the thread getting preempted
the implementation is simple, however we need root privileges

FAQ - stuff from your emails

Q: Does the “Getting direct ticks” approach take over the (Raspian/Linux) system timer? In other words, does this approach change/alter/affect the running OS?
A: No, definitely not. It's just pure "read" of hardware 1us clock/counter. You can find more detailed explanation in Broadcom documentation of the CPU.

Q: At company XXXX, we have coding problem YYYY with Raspberry Pi, can you help us ?
A: Sure, why not. I'd be happy to provide my consultancy service or to implement something for you - contact me for more details.

Q: I'm not a company, I'm just working on personal project XXXX. I'm stuck with YYYY, can you help me ?
A: If the question is interesting - send me detailed info. Maybe it's so interesting I'll try to work out the answer.
If it's not you can alwasy try Pi forums. I have very limited amount of time and family to feed :) Simply can't work for free to cover your experiments.