The exact time for functions or CPU cycles for any function

Question 1

I'm trying to read input data from the parallel port of a PC and send it to another device. For this purpose the input data is saved, when it's available and is sent if a flag is set.

The whole procedure should be done in less than 10 ms. I need to know the running time. How can I compute the CPU cycles for ReadParallel and WriteParallel functions (digitalRead, digitalWrite, bitRead,bitWrite) in my code or any other instructions?

Here is the code for the Arduino Uno:

#define InterruptInputPin 2 // For checking if the input data is available
#define InterruptSensorPin 3 // For checking when to send data
#define IN0 A0 
#define IN1 A1
#define IN2 A2
#define IN3 A3
#define OUT0 5
#define OUT1 6
#define OUT2 7
#define OUT3 8
volatile int ParallelInput; // The input data
//*************************************
void ReadParallel() //Reads input and saves it in parallelInput.
{
 if (digitalRead(IN0) == HIGH)
 bitWrite(ParallelInput,0,1);
 else
 bitWrite(ParallelInput,0,0); 
 if (digitalRead(IN1) == HIGH)
 bitWrite(ParallelInput,1,1);
 else
 bitWrite(ParallelInput,1,0); 
 if (digitalRead(IN2) == HIGH)
 bitWrite(ParallelInput,2,1);
 else
 bitWrite(ParallelInput,2,0); 
 if (digitalRead(IN3) == HIGH)
 bitWrite(ParallelInput,3,1);
 else
 bitWrite(ParallelInput,3,0); 
}
//***************************************
void WriteParallel() //Writes ParallelInput to output.
{
 detachInterrupt(digitalPinToInterrupt(InterruptInputPin));
 digitalWrite(OUT0,bitRead(ParallelInput,0));
 digitalWrite(OUT1,bitRead(ParallelInput,1));
 digitalWrite(OUT2,bitRead(ParallelInput,2));
 digitalWrite(OUT3,bitRead(ParallelInput,3));
 attachInterrupt(digitalPinToInterrupt(InterruptInputPin),ReadParallel,RISING);
}
void setup() {
 pinMode(IN0,INPUT);
 pinMode(IN1,INPUT);
 pinMode(IN2,INPUT);
 pinMode(IN3,INPUT);
 pinMode(OUT0,OUTPUT);
 pinMode(OUT1,OUTPUT);
 pinMode(OUT2,OUTPUT);
 pinMode(OUT3,OUTPUT);
 attachInterrupt(digitalPinToInterrupt(InterruptInputPin),ReadParallel,RISING); 
attachInterrupt(digitalPinToInterrupt(InterruptSensorPin),WriteParallel,RISING);
}
void loop() {
}

Question 2

Note that, in this particular instance, you can get much faster I/O (like 100 × faster or so) by using direct port access. For example: void ReadParallel() { ParallelInput = PORTC & 0x0f; } should compile to not much more than 6 instructions. Or even 4 if you declare ParallelInput as a byte rather than an int.

Question 3

Method summary

^{(section added on 2018年01月28日)}

There are several methods available for timing code. I am adding this preliminary section to my answer in order to provide comparative data on several methods. The methods covered in the table below are those proposed as answers to both this question and a duplicate question of this. As this question has been tagged "arduino-uno", all this data assumes an AVR-based board clocked at 16 MHz.

Method comparison:

| method | resolution | max. time | typ. overhead |
|----------------|---------------|-----------------|----------------|
| `millis()` | 1 – 2 ms | 49.7 days | 0.69 – 1.3 μs |
| `micros()` | 4 μs | 71.6 min | 2.8 – 2.9 μs |
| Timer 1 | 0.0625 μs | 4.096 ms | 0.25 – 0.5 μs |
| pin toggling | scope-limited | ∞ | 0.125 μs |
| cycle counting | 0.0625 μs | boredom-limited | 0 |
| looping | N.A. | N.A. | 0.25 μs |

The methods are characterized by the following criteria:

Resolution is the granularity of the measurement, the smaller the better.
Maximum measurable time: any method that does timing arithmetics on the Arduino is prone to overflows if measuring too long times. Note that a timer rolling over to zero during the measurement is not a problem, as long as the period being measured is less than the rollover period.
Typical measurement overhead: the code used to measure the timings takes itself a finite time to execute, thus one ends up measuring the execution time of the "instrumented" code, which is slightly larger than the time taken by the code one is trying to profile. This overhead should in principle be subtracted from the result, but it is often not known exactly, as it depends on how the compiler optimizes both the instrumented and the non-instrumented code.

The methods listed in the table are:

millis(), which may be the most obvious choice, as it is so well known. Its low resolution, however, makes it ill-suited for timing code execution. It should be noted that the millis() counter is incremented every 1024 μs. Most of the time it is incremented by 1 but, every 43 ms (roughly) it is incremented by 2 in order to avoid creeping drift. This is why its resolution is stated as "1 – 2 ms" in the table.
micros(), as proposed in ratchet freak's answer, is usually a good choice, the main caveat being the 4 μs resolution, when one could naively expect 1 μs. It also has a significant overhead.
Timer 1, which is discussed in the second part of this answer, is my favorite: it has single cycle resolution and low overhead. However, it is incompatible with other uses of the timer (PWM, Servo library...). It is also limited to measuring small delays.
Pin toggling, as proposed by 4ilo, and by myself in the third part of this answer, is ideal if you have an oscilloscope handy. Any half-decent scope should provide single-cycle resolution. It is also minimally invasive on the code being measured and has minimal overhead.
Cycle counting, as proposed in Majenko's answer, is arguably the "perfect" method: it is cycle-accurate, does not modify the code and has zero overhead. However, for anything beyond a handful instructions, it quickly becomes tedious. And it requires some understanding of the AVR assembly.
Looping, as proposed in Michel Keijzers' answer, is not a measurement technique per se. It is meant to be used in conjunction with another technique in order to improve the resolution and dilute the overhead. However, lopping carries it's own overhead, which is typically 4 CPU cycles per iteration, assuming a 16-bit loop counter.

Using Timer 1

^{(original answer of 2017年10月23日)}

One technique I often use is to make Timer 1 count at the full CPU speed and use it to time the code I want to profile. For example:

volatile uint8_t pin = 2;
volatile uint8_t value = HIGH;
void setup()
{
 Serial.begin(9600);
 // Set Timer 1 to normal mode at F_CPU.
 TCCR1A = 0;
 TCCR1B = 1;
 // Time digitalWrite().
 cli();
 uint16_t start = TCNT1;
 digitalWrite(pin, value);
 uint16_t finish = TCNT1;
 sei();
 uint16_t overhead = 8;
 uint16_t cycles = finish - start - overhead;
 Serial.print("digitalWrite() took ");
 Serial.print(cycles);
 Serial.println(" CPU cycles.");
}
void loop(){}

Note the volatile variables used to prevent the compiler from optimizing them as constants.

Note also that when you profile some code you are inevitably slowing it down, because of the time taken by the profiling operations themselves. This is what the overhead variable above accounts for. In order to know the exact overhead, I start with a guess, compile and disassemble, and then count the number of clock cycles spent in profiling that will be counted by the timer. Then I adjust the overhead value, compile and disassemble again, and make sure the overhead has not changed. This is also when you have to ask yourself what exactly you want to count. Here I am counting the time needed to execute call digitalWrite, but not the time needed to get the arguments in the proper registers, as I am artificially slowing this down by making them volatile.

This method is good for anything that takes more than roughly a dozen, and less that 65,536 clock cycles. Less than that, clock counting would be simpler, since you still have to clock-count the overhead. More than that, the count would overflow, and you could instead just use micros(), and live with its inherent inaccuracy.

Toggling a pin

^{(section added on 2017年10月23日)}

If you have a scope, there is another method that is minimally invasive on your code: toggle a pin just before and just after the thing you want to time. E.g., assuming you have previously pinMode(13, OUTPUT):

// Set pin 13 HIGH.
PORTB |= _BV(PB5);
// The thing we want to time.
...
// Set pin 13 LOW.
PORTB &= ~_BV(PB5);

This will create a pulse that you can measure on the scope. Note that using direct port access, like here, the overhead is only two CPU cycles, or 125 ns. Also, direct port access won't use any CPU register, so chances are the compiler will not generate less efficient code than when not including the timing part.

Question 4

And without a 'scope, I/O pin timing can be done with some DMMs. If it can measure pulse frequency and duty-cycle, you can calculate the interval (puse-width). Tekpower TP4000ZC is one such DMM and an inexpensive one, at that.

Question 5

Thanks for answer. I haven't worked with registers and direct port access, can you send me a document or link for more details?

Question 6

@Mehran: The link I already sent you about direct port manipulation is a good introduction. You may then want to take a look at the description of the avr-libc macros commonly used for that task (mostly _BV() which is the avr-libc's equivalent of Arduino's bit()). Then, the ultimate reference is the microcontroller's datasheet.

Question 7

There's two methods:

Profiling.
Clock counting.

The first method involves recording timestamps at different points in your program and calculating the time difference between them. That's the simplest, but not always the most accurate.

The second method is far harder and involves disassembling the program after you have compiled it (or obtaining the assembly language from part way through the compilation sequence) and examining all the assembly instructions, looking them up in the instruction list, and totalling how many clock cycles are used for each instruction. That gets very complex, especially when you have loops. However it will tell you precisely how many clock cycles, and thus how long, a block of code will take to execute (not including interrupts, which always throw a big spanned in the works).

Question 8

To efficiently measure the time a block of code takes, run it for like 1,000 or 1,000,000 times and divide the time by the amount of iterations.

In some cases initialization/variables can be cached but in principle the times are quite accurate. You can easily check this by doing the test for e.g. 1,000 and 2,000 times and see that the time difference also is a factor 2.

For the time counter, use an unsigned long type to accommodate more than 65,535 (ms/us whatever the unit is used).

Question 9

Keep in mind that managing the loop carries some overhead, typically about 4 CPU cycles per iteration for a 16-bit loop counter. You will have to subtract this from your result if you want sub-µs accuracy.

Edgar Bonet Edgar Bonet 45.1k4 gold badges42 silver badges81 bronze badges · Accepted Answer · 2017-10-23 17:53:12Z

Method summary

^{(section added on 2018年01月28日)}

There are several methods available for timing code. I am adding this preliminary section to my answer in order to provide comparative data on several methods. The methods covered in the table below are those proposed as answers to both this question and a duplicate question of this. As this question has been tagged "arduino-uno", all this data assumes an AVR-based board clocked at 16 MHz.

Method comparison:

| method | resolution | max. time | typ. overhead |
|----------------|---------------|-----------------|----------------|
| `millis()` | 1 – 2 ms | 49.7 days | 0.69 – 1.3 μs |
| `micros()` | 4 μs | 71.6 min | 2.8 – 2.9 μs |
| Timer 1 | 0.0625 μs | 4.096 ms | 0.25 – 0.5 μs |
| pin toggling | scope-limited | ∞ | 0.125 μs |
| cycle counting | 0.0625 μs | boredom-limited | 0 |
| looping | N.A. | N.A. | 0.25 μs |

The methods are characterized by the following criteria:

Resolution is the granularity of the measurement, the smaller the better.
Maximum measurable time: any method that does timing arithmetics on the Arduino is prone to overflows if measuring too long times. Note that a timer rolling over to zero during the measurement is not a problem, as long as the period being measured is less than the rollover period.
Typical measurement overhead: the code used to measure the timings takes itself a finite time to execute, thus one ends up measuring the execution time of the "instrumented" code, which is slightly larger than the time taken by the code one is trying to profile. This overhead should in principle be subtracted from the result, but it is often not known exactly, as it depends on how the compiler optimizes both the instrumented and the non-instrumented code.

The methods listed in the table are:

millis(), which may be the most obvious choice, as it is so well known. Its low resolution, however, makes it ill-suited for timing code execution. It should be noted that the millis() counter is incremented every 1024 μs. Most of the time it is incremented by 1 but, every 43 ms (roughly) it is incremented by 2 in order to avoid creeping drift. This is why its resolution is stated as "1 – 2 ms" in the table.
micros(), as proposed in ratchet freak's answer, is usually a good choice, the main caveat being the 4 μs resolution, when one could naively expect 1 μs. It also has a significant overhead.
Timer 1, which is discussed in the second part of this answer, is my favorite: it has single cycle resolution and low overhead. However, it is incompatible with other uses of the timer (PWM, Servo library...). It is also limited to measuring small delays.
Pin toggling, as proposed by 4ilo, and by myself in the third part of this answer, is ideal if you have an oscilloscope handy. Any half-decent scope should provide single-cycle resolution. It is also minimally invasive on the code being measured and has minimal overhead.
Cycle counting, as proposed in Majenko's answer, is arguably the "perfect" method: it is cycle-accurate, does not modify the code and has zero overhead. However, for anything beyond a handful instructions, it quickly becomes tedious. And it requires some understanding of the AVR assembly.
Looping, as proposed in Michel Keijzers' answer, is not a measurement technique per se. It is meant to be used in conjunction with another technique in order to improve the resolution and dilute the overhead. However, lopping carries it's own overhead, which is typically 4 CPU cycles per iteration, assuming a 16-bit loop counter.

Using Timer 1

^{(original answer of 2017年10月23日)}

One technique I often use is to make Timer 1 count at the full CPU speed and use it to time the code I want to profile. For example:

volatile uint8_t pin = 2;
volatile uint8_t value = HIGH;
void setup()
{
 Serial.begin(9600);
 // Set Timer 1 to normal mode at F_CPU.
 TCCR1A = 0;
 TCCR1B = 1;
 // Time digitalWrite().
 cli();
 uint16_t start = TCNT1;
 digitalWrite(pin, value);
 uint16_t finish = TCNT1;
 sei();
 uint16_t overhead = 8;
 uint16_t cycles = finish - start - overhead;
 Serial.print("digitalWrite() took ");
 Serial.print(cycles);
 Serial.println(" CPU cycles.");
}
void loop(){}

Note the volatile variables used to prevent the compiler from optimizing them as constants.

Note also that when you profile some code you are inevitably slowing it down, because of the time taken by the profiling operations themselves. This is what the overhead variable above accounts for. In order to know the exact overhead, I start with a guess, compile and disassemble, and then count the number of clock cycles spent in profiling that will be counted by the timer. Then I adjust the overhead value, compile and disassemble again, and make sure the overhead has not changed. This is also when you have to ask yourself what exactly you want to count. Here I am counting the time needed to execute call digitalWrite, but not the time needed to get the arguments in the proper registers, as I am artificially slowing this down by making them volatile.

This method is good for anything that takes more than roughly a dozen, and less that 65,536 clock cycles. Less than that, clock counting would be simpler, since you still have to clock-count the overhead. More than that, the count would overflow, and you could instead just use micros(), and live with its inherent inaccuracy.

Toggling a pin

^{(section added on 2017年10月23日)}

If you have a scope, there is another method that is minimally invasive on your code: toggle a pin just before and just after the thing you want to time. E.g., assuming you have previously pinMode(13, OUTPUT):

// Set pin 13 HIGH.
PORTB |= _BV(PB5);
// The thing we want to time.
...
// Set pin 13 LOW.
PORTB &= ~_BV(PB5);

This will create a pulse that you can measure on the scope. Note that using direct port access, like here, the overhead is only two CPU cycles, or 125 ns. Also, direct port access won't use any CPU register, so chances are the compiler will not generate less efficient code than when not including the timing part.

And without a 'scope, I/O pin timing can be done with some DMMs. If it can measure pulse frequency and duty-cycle, you can calculate the interval (puse-width). Tekpower TP4000ZC is one such DMM and an inexpensive one, at that.
Thanks for answer. I haven't worked with registers and direct port access, can you send me a document or link for more details?
@Mehran: The link I already sent you about direct port manipulation is a good introduction. You may then want to take a look at the description of the avr-libc macros commonly used for that task (mostly _BV() which is the avr-libc's equivalent of Arduino's bit()). Then, the ultimate reference is the microcontroller's datasheet.

Stack Exchange Network

The exact time for functions or CPU cycles for any function

3 Answers 3

Method summary

Using Timer 1

Toggling a pin

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

The exact time for functions or CPU cycles for any function

3 Answers 3

Method summary

Using Timer 1

Toggling a pin

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions