Container or VM resource resource settings can dramatically improve or destroy efficiency (cost/performance) of Cloud Apps. Likewise, the values of application settings like garbage collection, buffer sizes and others can also boost or reduce efficiency. Get them wrong and the combination of resources and application parameters can cripple your app and empty your wallet. Get them right and you can cut costs 30-50% and/or improve performance by over 100%. All of this is possible without changing a single line of code or negatively affecting reliability. Why, then, do so few optimization efforts succeed?
10 – It’s a project
I wrote about earlier about why projects fail where process succeeds. I won’t rehash all that here, but suffice it to say projects aren’t reliable. Projects compete for human resources so they seldom start on schedule and even less frequently finish on schedule. Plus in the era of fully automated delivery pipelines any optimization project will take longer than the typical release cycle. Hence, the results will be out of date before the project is even completed. Even IF they’re implemented the results will likely be far from optimal.
9 – It’s not on the roadmap
Having worked in product management for a few decades now I can tell you, anything that isn’t on the roadmap seldom gets done. Performance tuning seldom makes the roadmap unless performance is truly broken and can’t be “fixed” by throwing more money at resources. Even when optimization makes it to the roadmap, it isn’t given high enough priority to actually get into a sprint. The CFO may be screaming about costs, but sales will attach a potential customer name to each new feature and as long as sales are growing that will push features to the top of the list.
8 – Complexity
Modern cloud native microservice architectures have hundreds or even thousands of elements to manage. Each has is own parameters and resource assignments. This makes tuning exceedingly difficult. The old grid search breaks down after a couple dozens permutations and modern architectures frequently have trillions. Moreover, service elements may have interdependencies with service components and this introduces added variability in the results. As I’ve noted in previous posts performance tuning is no longer a job for humans.
7 – Inadequate traffic generation
There are two ways to do performance tuning and most companies we engage with aren’t properly prepared for either. The first and most common is the use of traffic generators. However, for proper tuning you need to be able to load a system until it breaks. More importantly the traffic must mimic real world conditions including a mix of traffic types and errors.
Without a reliable mechanism for traffic generation and a way to account for variations in users etc., you can only test whether the application logic and architecture are capable of adequate performance when not stressed. If you hear the term headroom used, it’s most often because this is the type of testing done and they don’t know what the failure modes look like or where they occur when the system is saturated.
6 – Lack of canary infrastructure
The second, and better, method for testing is to use canaries. Siphoning a small amount of production traffic to a unit under test ensures a real world distribution of traffic types. However, using canaries introduces new complexity; upstream noise and errors. How to deal with those is beyond the scope of this post, but I’ll try to write up something soon. To the point of this post though, only a few of the companies we engage with have canary architectures in place.
5 – It’s time consuming
Engineers don’t want to do performance tuning. It’s not glamorous. To be a hero these days means moving a graph and the output of performance tuning efforts seldom does that. In part that’s because the big savings often come from trying combinations way outside the expected solution set where, as noted above, humans simply don’t have time to test.
4 – Lack of expertise
Performance tuning itself is complex but not complicated. Develop settings to test, implement, measure, repeat. However, there are many arcane settings available and some are quite old. My favorite, which I’ve written about before, is noatime. I’ve spoken with folks at some of the most advanced companies in tech and they not only know about noatime, they have standardized on a setting to leverage it for savings. If you’ve never heard of it you’re not alone. Noatime turns off the update of file system timestamps every time a file is updated. In the era of cloud native apps and containers that no one ever logs into, what’s the point of updating timestamps. Hence save time by not doing it. Noatime, though, wasn’t created for cloud native apps. It goes back to the 70’s when writing to disk was so slow it could bring apps to a crawl so we (yes, I am that old) turned off timestamps before doing a lot of disk access and then turned it back on. Noatime is just one of hundreds of settings your team probably never about.
3 – The best results often defy human logic
Recent advances in AI have produced new systems capable of teaching themselves chess or Go to master level in just days with no preprogramming of strategy. These systems quickly become so good they surpass not only humans but also all previous AI gaming systems even though they don’t look as many moves ahead and often don’t have the same processing power. The reason? They play in completely new ways, making moves that seem like losing strategies. In our work to implement AI for continuous optimization with clients one of the things that struck me from the start was how often the optimal solution made no sense in human terms. Only an AI will find these opportunities for improvement.
2 – Rapid code releases
When SaaS first emerged one of the strategic advantages that got little attention initially was that it removed almost everyone between the engineer writing code and the customer using that code. Features could get into production in weeks or months instead of annual and bi-annual releases. SaaS has taken over the software world and deploying features quickly has become a religion. Agile, CI/CD and fully automated deployment followed. We routinely work with companies deploying code multiple times a day. And those that aren’t, want to. Performance tuning projects, though, takes days or even weeks, by the time they’re done the data is no longer optimal.
1. They never get started
The biggest reason performance optimization fails is that very, very few companies actually try. This is the optimization paradox. Despite the opportunity, most teams simply select a large instance and leave the default for most settings. No one bothers to dig into that config file and try to understand the choices they have made. As long as the system meets spec, most engineers don’t try to optimize beyond tweaking the initial settings. Aside from a few very large companies like Google, Microsoft, Facebook and Netflix, no one seems to be committing resources to take advantage of this opportunity.
Silver Lining for Cloud App Optimization
A solution to the failures above does exist and the efficiency improvements available are substantial. Rather than implementing optimization as a project, simply add a Continuous Optimization PROCESS to your CI/CD toolchain. Doing so can boost performance while reducing costs. Being part of the toolchain you’ll be assured every release is optimized as part of delivery. Plus the rewards are so great most users achieve a positive ROI in less than a quarter.