The Cloud App Optimization Paradox

Ask anyone how fast they want their app to run and they’ll say as quickly as possible.  Ask how much they want to spend per transaction and they’ll say as little as possible. Ask how often they optimize resource allocation and parameter settings and they’ll say they test performance against the spec when they release a new service.  Ask them why if cost and performance are so important they only test once and you’ll likely hear silence.

Optimization can improve efficiency (cost/performance) dramatically. Selecting the correct values for the OS, middleware layer, autoscaler and the cloud or container platform can cut costs 30-50% and/or improve performance by over 100% without changing any code or negatively affect the running service.

Optimization Isn’t Being Done

Despite the opportunity, most teams simply select a large instance and leave the default for most OS and middleware settings. As long as the system meets spec, most engineers don’t try to optimize beyond tweaking the initial settings. Aside from a few outliers like Google, Facebook and Netflix, no one seems to be committing resources to take advantage of this opportunity.  Why, given the possible returns, do so few companies try to optimize?

It’s Complicated

Before cloud computing, servers were expensive, were rationed, required approvals, took weeks to provision, and were configured and managed by operations teams. Most development engineers never touched them. Now servers are VM instances or containers and any engineer with privileges can instantly add or change them. More importantly, the development engineer is now frequently responsible for the configuration of the server instance running her code. But, are engineers trained for this new task?

Many engineers who’ve graduated since the introduction of cloud likely have no experience with server configuration. They’re not aware of most settings and lack experience adjusting them. VM and container instances include hundreds of settings that affect performance and cost. Changing settings for one part of the app affects the performance of the whole system. Even a relatively simple app with 5 microservices can present trillions of setting permutations. Although new tools have cropped up over the past decade to implement settings, none help engineers determine the right settings.

It’s Frustrating

When performance issues are serious enough to embark on an effort to optimize, “best practice” is to create a project and bring in a performance team to diagnose and address active pain points. These projects are long and expensive and the results can be ephemeral because optimization needs to be restarted every time new code is released. As the CTO of one SaaS decacorn put it, “We tried an optimization project on our customer portal and the results were impressive.  After nine weeks of iterative refinement we had cut response time in half and increased the number of concurrent users by 42%. Unfortunately, as soon as we updated that app we were back to where we started.”

It’s Someone Else’s Problem

Cost and performance are two fundamental, deeply related KPIs presenting real pain to enterprises large and small. Improvements in latency directly impact the customer experience. Cost reductions directly impact the bottom line.  Often, though, ownership isn’t clear. Who owns these metrics in your weekly stand up?

No reasonable CTO is going to micromanage the decisions of their teams about what instance to choose or how to configure it. No annual review is going to ding someone for over provisioning a microservice or choosing the wrong settings for the load balance – but these decisions directly affect performance.

At the same time, few if any developers, members of the DevOps team or even their managers know how much is spent on compute or how much is spent by the code they deploy. Many devs will choose the largest available VM “just in case”, never even thinking about right sizing. It only costs a couple of pennies more per instance — pennies that pile up into millions of dollars.

Here is the Optimization Paradox. Getting resource settings and parameters right can double the performance of your application while saving millions. These gains are available without compromising features, they do not disrupt the product roadmap, and they do not require code changes. Optimization is the lowest hanging fruit to double app efficiency and yet it is rarely done because most organizations don’t appreciate how much they can gain.

Even when the gains are in site and  optimization has a strong evangelists, teams discover that the effort doesn’t fit nicely into a weekly scrum format. Agile methodology is poorly suited to a problem that requires continuous effort.

So how do you solve a problem that doesn’t match the skill set of your team or your agile workflow and address trillions of potential combinations which need to be reassessed every time new code is released without hijacking the DevOps team and locking down your code?  We figured this is a perfect use case for Machine Learning.

Process, Not Projects

Optune uses Artificial Intelligence to overcome the complexity of performance tuning. Leveraging your existing CI/CD automated deployment pipeline, the system tests setting combinations and quickly learns how to improve the efficiency of your application. Changes such as releasing new code, new instance types from your cloud provider, or new traffic patterns will cause Optune to adjust settings. Being integrated with your CI/CD system, new changes take effect automatically. With each iteration, the system’s predictions hone in on the optimal solution, and as improvements are found they can be automatically promoted.

Focus On Business Metrics

To provide the maximum impact for your business, Optune works at a service scope rather than focusing on individual containers or virtual machine metrics. Optimizing performance across an entire service yields superior results over systems focused on single container or virtual machine utilization. Optune can balance the performance and resource assignments of multiple components to create customer visible benefits, such as reduced latency, greater transactions per second, increased number of simultaneous users, or lower cost. Optune enables greater responsiveness to business needs like customer satisfaction and budget constraints.

Share This