Use these tags to explore our category pages to learn more
This article is part of the series
"Cloud-Based Embedded Testing: A Case Study."
Discover more:
Cloud testing of embedded software is an essential part of the development cycle, but it comes with its own set of financial challenges. To achieve effective cost optimization in cloud environments, it’s crucial to adopt strategies that balance performance needs with financial efficiency.
This article explores practical approaches to cost management in cloud testing, with a focus on optimizing resources in Amazon EC2 environments, minimizing overhead times, and determining the ideal number of instances for parallel testing. From leveraging less powerful instances to balancing the use of Spot and On-Demand instances, we delve into how these tactics can help maintain high testing standards without breaking the budget. This is the first in a four-part series dedicated to maximizing efficiency and minimizing expenses in cloud-based testing for embedded software.
Overhead times play a crucial role in cloud testing. They should be kept as short as possible to save costs. Therefore, all unnecessary redundant tasks and calculations, as well as communication bottlenecks, should be avoided.
Examples
Larger data sets can be copied to a cloud storage to avoid long communication paths between the computer and the instance.
Architectural diagrams (component diagram, flow diagram) always provide a good overview. It should be clearly shown what exactly happens, and then time measurements should be taken for each process step. This provides a good overview of where there may be bottlenecks and what could be optimized.
Just like with physical machines, one might be tempted to adopt the mindset of "having more is better than needing it" when configuring cloud servers. However, in cloud testing, this can quickly and unnecessarily escalate costs. Because more power doesn ́t always equal faster performance.
In the end it is crucial to obtain results as quickly as possible. Now, parallelization comes into play. It means that if two less powerful instances deliver results just as fast as one powerful instance and the overall operation is more cost-effective, then the more economical option is recommended.
Experiment with the available instance types to determine the necessary resources for test runs. With the data collected, it’s easy to calculate the optimum balance between performance and costs. One likely surprise will be how few resources are actually needed.
Theoretically, it’s always possible to start up more instances and thus reduce the duration of test execution. However, costs will continue to rise, and at an accelerating rate, the closer the two times (Overhead and Test Run) get.
No further acceleration through additional instances when one of these limits is reached:
Limit 1 – "Reduction of Run Time": If the runtime of an instance is already in the order of the overhead timings (t_Runtime ~ 10* t_overhead), refrain from further improving the runtime with additional instances. Costs increase very quickly from this point onwards without a corresponding time gain. Limit 1: t_testrun_per_instance <= 10 * t_overhead
Limit 2 – "Practicality": Is it really necessary to have the test results even faster than within 1 hour? We recommend stopping the further instantiation of additional instances once t_Runtime ~ 1h has been reached.
The following scenario arises: There has been a change to a unit which should be verified by using a unit test. After that, a procedural integration test follows. Could one not run the integration test in parallel with the unit test to save time? This entails a cost risk. Why? If the unit test fails, the integration test becomes obsolete. Costs have been incurred for unusable results.
But there is the option to reuse the instance for the unit test, thus saving the overhead, i.e., unproductive times of an instance: Before the end of the test run, the necessary data (software, test data, etc.) are cleverly loaded into the instance, and the execution of the integration test is initiated. At the same time, the results of the unit test are downloaded.
Of course, the question arises: If I am already downloading the results, how do I know if the test run was successful? The solution is to continuously monitor the running instance, as most results are already known before the end of the test. The risk is then low that an integration test starts while the previous unit test has failed.
In the end, the additional cost risk is close to zero.
In careful consideration when the test results are truly needed, timeframes can be set.
Why use many instances to get test results within an hour when they are not needed at that time? For example: starting a test on Friday evening – at 6 PM – and the results are only needed on Monday.
In CI environments, test runs can be organized automatically. With the right strategy, planning, and a little patience, costs can be massively reduced, as seen in the use of long-term rentals.
At many cloud providers, in addition to On-Demand, it’s possible to rent computing resources on a long-term basis. There are several payment models depending on the desired period of commitment.
Hybrid models are also possible: for example, by covering the base load with long-term rentals and using On-Demand for additional instances as needed. The cost savings compared to Spot Instances are of course much lower, but the setup is less complex.
AWS offers Spot Instances as a solution for unused computing capacity, as data centers are rarely fully utilized, and they are offered with a discount of 80 percent or more.
However, there is a risk that these instances may be terminated in favor of paying customers. Depending on the EC2 type, the probability of this happening is around 20 percent.
For a better assessment, let’s compare the costs of Spot Instances with On-Demand Instances using the following assumptions:
A shutdown risk of approximately 20 percent increases the number of necessary instances for a test run.
80 percent discount on Spot Instances compared to equivalent On-Demand Instances.
The total costs for using Spot Instances TC_Spot can be calculated using the following formula:
If instances are subject to automatic shutdown, you won’t have test results and you’ll need to restart those instances.
With a shutdown risk of Risk_Shutdown > 0%, when testing with Spot Instances, you always have an additional demand for instances (PAI) compared to testing with On-Demand Instances.
This percentage of additional demand for instances (PAI) describes the ratio of the number of Spot Instances to the number of On-Demand Instances for a complete test run and can be sufficiently approximated with the following formula:
Since the number n (necessary repetitions of failed Spot Instances) depends on the number of instances and the shutdown risk, we approximate the PAI with a mathematical trick. We calculate with an infinite number of repetitions. This allows us to determine the theoretically maximum PAI.
PAI = (20%)^0 + (20%)^1 + (20%)^2 + ... + (20%)^n
PAI = 100% + 20% + 4% + 0.8% + ... => 125%
More concrete? Now with specific numbers.
With 100 instances, you have to expect 20 instance shutdowns. These 20 instances are restarted. Of the 20 instances, 20 percent are shut down again, so 4 instances. These are restarted again. And so on.
In total, you need 125 instances.
The number of necessary instances for a test run will increase. However, the cost of a Spot Instance is lower than that of an On-Demand Instance.
The percentage savings are:
With an assumed discount of 80%, the percentage cost savings is 20%. In other words, operating a Spot Instance costs only one-fifth of the operating cost of an On-Demand Instance.
In our example calculation, you only have 1/4 of the costs.
In other words: you save 75% of the costs. Spot Instances are worth it!
The cost savings when testing with Spot Instances also means that the duration of a test run is increased due to interruptions and restarts. How much longer it takes mainly depends on the initial number of instances. This can be calculated again using percentage calculations and exponential growth.
For simplified calculation, we have created a Spot Instances Calculator.
Please note that using Spot Instances comes with specific risks and benefits. It’s important to consider the individual requirements and resources of the application before implementing the strategy.
The duration of using Spot Instances depends on the number of initial instances started. Additional iterations can arise from repeated restarts. The exact timing of instance terminations is uncertain, whether it happens right at the beginning or only towards the end. Continuous monitoring and automated restarting are very beneficial in this scenario.
This model can be cost-effective even if the exact duration is uncertain. If you don’t need the test results immediately, a recommended strategy is to combine Spot and On-Demand instances.
To do this, you need to know when valid results should be available, for example, Monday at 8:00 AM. With this time in mind, you can calculate backwards to determine when multiple parallel On-Demand instances need to be started to obtain all test results. The time from test start to this latest possible deadline can be used for execution with Spot Instances. Instances that have already run in Spot mode do not need to be counted with On-Demand instances. The setup for planning and monitoring is a bit more complex, but in this hybrid model, you have the necessary security and cost efficiency for your test execution.
It is advisable to keep track of changes in the payment terms.
The rate of instance shutdowns can vary depending on the country and time.
The likelihood of termination increases with the duration of the instance. It is advisable to adjust the cost-benefit optimum accordingly.
Implementing automation requires a deep understanding to avoid costly mistakes.
It may happen that test runs contain misconfigurations, so it’s important to monitor instances. If the execution takes unusually long, it should be manually stopped or, in the case of CI applications, automated. Then, secure the results, shut down the instances, and inform the operations and service team, for example, through an atomized abort email.
Monitor the test execution from the beginning and verify the result after each test case. If many tests fail, the execution should be aborted. The results should be saved, and all instances should be shut down.
Introduce a sensible threshold value at which test runs should be aborted. If a large number of tests fail, in most cases there is something wrong with the test execution or there is a major bug in the code. In most cases, not all tests are needed for the analysis.
A Simulink model is to be tested? First in MiL and then in SiL. At first glance, this sounds like a good idea. But why should the model be tested once and then the generated code?
Our recommendation: do it without the MiL run and test only the code.
This saves the execution of the model and the costs for licenses and still has meaningful results.
There is no such thing as a single, optimal test strategy. It is too strongly linked to the product, the requirements and the goals. We would be happy to help you develop a suitable strategy in an individual, free strategy discussion.
More About Testing Services