This is a part of our Serverless DevOps series exploring the future role of operations when supporting a serverless infrastructure. Read the entire series to learn more about how the role of operations changes and the future work involved.
Lambda compute is highly elastic, but your wallet is not.
While many engineers may not be that interested in cloud costs, the organizations that pay our salaries are. Cost is ever present in their mind. The consumption-based — as opposed to provisioned capacity-based — cost of serverless creates new and unique challenges for controlling cloud spend. As your applications usage grows, your costs grow directly with it. But, if you can track costs and revenue down to the system or function level (a practice called FinDev), you can help your organization save money, and even grow.
Additionally, as we progress up the application stack, and therefore up the value chain, those of us in product or SaaS companies will look at revenue numbers for our team’s services. We typically treat cost alone as an important metric, when really it’s the relationship between cost and revenue that’s more important. An expensive system cost can be offset or made irrelevant by revenue generation. As members of product pods tasked with solving business problems, we’ll be responsible for whether or not what we’ve delivered has performed.
Cost Isn’t the Only Consideration
To start, focusing on cost alone isn’t helpful. It leads to poor decision-making because it’s not the entirety of your financial picture. Costs exist within the context of a budget and revenue.
If your cost optimization results in only fractions of a percent in savings on your budget, you have to ask if the work was worthwhile. Saving $100 a month matters when your budget is in the hundreds or low thousands per month. But it doesn’t really matter when your budget is in the hundreds of thousands per month.
You have to spend money to make money, as well. Your revenue generation from a product or feature could also make your cost optimization work largely irrelevant. If your product is sold on a razor-thin margin, then cost efficiency is probably going to count. But if it’s a high-margin product, then you’re afforded a degree of latitude in cost inefficiency. That means you’re going to have to understand to some degree how your organization works and how it generates revenue.
Stop looking at your work as just a cost, but as work that generates part of your organization’s revenue, too! After all, how long does it take to start realizing cost savings once you factor in developer productivity?
Serverless Financial Work
Serverless uses a consumption model for billing, where you only pay for what you use, and a change in cost month over month may or may not matter. A bill going from $4 per month to $6 per month doesn’t really matter. A bill going from $40k per month to $60k per month will probably matter. You can begin to see the added billing complexity that serverless introduces. Cost should become a first-class system metric. We should be tracking system cost fluctuations over time. Most likely we’ll be tracking system costs correlated with deploys so when costs do jump we’ll have the context to understand where we need to be looking.
Let’s start with the immediate financials challenges of serverless. It’s consumption based pricing and resulting variability creates some new and interesting challenges.
To start, it can be difficult to determine what is a suitable cost to run serverless, as opposed to in a container or on an EC2 instance. There’s a lot of conflicting cost information. You’ll find everything from serverless being significantly less expensive to serverless being significantly more expensive. The fact is, there’s no simple answer. But that means there’s useful work for you to do.
Determining the cost benefit of moving a few cron jobs, slack bots, and low-use automation services is easy. But once you try and figure out the cost of a highly used complex system, the task becomes harder. If you're attempting to do a cost analysis, pay attention to these three things:
- Compute scale
- Operational inefficiency
- Ancillary services
When it comes to compute, start by ensuring you're comparing apples to apples as well as you can. That means first calculating what the cost of a serverless service would be today as well as, say, a year from now based on growth estimates. Likewise, make sure you're using EC2 instance sizes you would actually use in production, as well as the number of instances required.
Next, account for operational inefficiency. An EC2-based service may be oversized for a variety of reasons. You may need only one instance for your service, but you probably have more for redundancy. You may have more than you need because of traffic bursts, or because someone has not scaled down the service from a previous necessary high.
Finally, think about ancillary services on each host. How much do your logging, metrics, and security SaaS providers cost per month. All these will give you a more realistic approach to cost.
"This will cost you more than $5 per month on EC2 because you cannot run this on a single t2.nano with no metrics, monitoring, or logging in production."
The major cloud providers release case studies touting the cost savings of serverless, and I've had my own discussions with organizations. I've seen everything from "serverless will save you 80 percent" to "serverless costs 2X as much". Both could be true statements, but the devil is in the details. Does the organization that saved so much money have a similar architecture to yours? Is the 2X cost a paper calculation or one supported by accounting?
The organization that gave me the 2X calculation followed up with operational inefficiency and ancillary service costs eating up most of their savings, to the point they considered serverless and EC2 roughly even. For that reason, they require a cost analysis and convincing argument for a decision not to go serverless.
Next, let's talk about how to keep from wasting money. Being pay-per-use, reliability issues have a direct impact on cost. Bugs cost money. Let's look at how with a few examples.
Recursive functions, where a function invokes another instance of itself at the end, can be a valid design choice for certain problems. For example, attempting to parse a large file may require invoking another instance to continue parsing data from where the previous invocation left off. But anytime you see one, you should ensure that the loop will exit, or you may end up with some surprises in your bill. (That has happened to me. Ouch.)
Lambda has built-in retry behavior, as well. Retries are important not just for building a successful serverless application, but for a successful cloud application in general. But each retry costs you money.
You might look at your metrics and see a function has a regular rate of failed invocations. You know from other system metrics, however, that the function eventually processes events successfully, and the system as a whole is just fine. While the system works fine, those errors are an unnecessary cost. Do you adjust your retry rate or logic to save money? Before you start refactoring, take some time to calculate potential savings from a refactor over the cost in time and effort.
There’s also potential architectural cost waste that can occur. If you’re familiar with building microservices and standardizing HTTP as a communication transport, then your first inclination may be to replicate that using API Gateway and Lambda. But API Gateway can become expensive. Does it make more sense to switch to SNS or a Lambda fan-out pattern (where one Lambda directly invokes another Lambda function) for inter-service communication? There’s no easy answer to that question, but someone will have to answer it as your team designs services.
Application Cost Monitoring
We should be monitoring system cost throughout the month. Is the system costing you the expected amount to run? If not, why? Is it because of inefficiency or is it a reflection of growth?
The ability to measure cost down to the function level — and potentially the feature level — is something I like to call application cost monitoring. To start, enable cost allocation tags in your AWS account. Then you can easily track cost-per-function invocation and overall system cost over time. Overlay feature deploy events with that data and you can understand system cost changes at a much finer level.
Picture how you scale a standard microservice running on a host or hosts. Your options are to scale instances either horizontally or vertically. When scaling horizontally you’re adjusting the number of instances that can service requests. With vertical scaling you’re adjusting the size of your instances, typically in relation to CPU and/or memory, so that an instance has enough resources to service a determined rate of requests. When these system falls outside of spec in terms of performance you right-size the system by scaling in the appropriate direction.
Each feature or change to a microservice’s codebase usually has only a minimal effect on cost. (I say usually because some people like to ship giant PRs and watch their services be ground into dust under real load, requiring frantic scaling.) Your additional new feature does not have a direct effect on cost unless it results in a change in scaling the service vertically or horizontally. It’s not individual changes, but the aggregate of them over time that affects changes.
But with serverless systems it’s different. System changes are tightly coupled with cost. Make an AWS Lambda function run slightly slower and you could be looking at a reasonable jump in cost. Based on AWS’s pricing of Lambda where they round up to the nearest 100ms, a function that goes from executing with an average duration of 195ms costs roughly 45% more if it starts averaging 205ms. Additionally, increasing function RAM raises cost… But that might result in shorter function invocation durations so you end up saving money. And these calculations don’t even take into account the situations where a system is reconfigured and new AWS resources such as SNS topics, SQS queues, and Kinesis streams are added or removed.
As you can see, cost needs to become a first-class system metric with serverless. We also need tools to help us model cost changes to our serverless systems. All of this is because if cost monitoring and projection informs us about money already or about to be spent, the next topic will bring together money spent, money allocated, and money generated.
The Future Into FinDev
Application cost monitoring helps yield a new practice popularized by Simon Wardley called FinDev. We’re going to be including more than just cost and budgets into our engineering decisions. If we can track cost down to the system or even function level, can we take this a step further and track revenue generation down to that level? If we can, then we can include revenue, either existing or projected revenue, in addition to cost and budgets to form a fuller financial picture of engineering effort and productivity.
What Is FinDev?
This requires bridging finance, PM (either product or project management), and engineering ( both at the practitioner and leadership level) at a minimum. We want to track cash flow through the organization so engineering efforts can be directed toward making decisions with the greatest business impact. This efficiency has the potential to become a competitive advantage over those who are unable to prioritize their engineering time toward providing the most value.
It starts with tracking revenue into an organization and mapping it back to engineering work and running production systems. If there’s a change in revenue month over month then why? Has revenue picked up because a new feature or service has gone to production? Has it decreased because a service is failing to perform adequately? With this relationship established, we can assign monetary values to systems. And now we can look at our technical systems with a much fuller financial picture surrounding them.
We can also establish a feedback loop in our organization, now. Revenue data and system value should then be made available to engineering leadership and PMs in order to prioritize work properly. You have two serverless systems exhibiting issues, which do you prioritize? If there’s a significant difference in value between systems, prioritize the more valuable one. If you’re evaluating enhancements, the question becomes even more interesting. Does your team prioritize generating more money out of an existing service, or does it prioritize enhancing an underperforming service?
Last but not least, engineering and PMs should close the loop by measuring success. Has engineering returned a system to its previous financially performant state? Has work decreased or increased revenue? Has a change increased revenue but cost has eroded profit margin? These are all interesting questions to spur the next cycle of engineering work.
Keep in mind, for many of the preceding questions there is no right answer. The data doesn’t make decisions for you. It helps you to make more informed decisions.
Applying FinDev Ideas Today
It should be noted that assigning value to systems and prioritizing work is not an entirely new concept. Many organizations already assign dollar values to systems and prioritize work based on that value. But the more granular we can get with assigning value the more granular we can get within an organization’s hierarchy for prioritizing work.
Already, good product companies attempt to measure the success of what they deliver through metrics like user adoption and retention. Their next step is to understand revenue generation down to the technical level.
But even in non-product companies there will be room to apply these principles. For example, IT delivers a service that reduces friction in the sales process. Does that service lead to increased revenue for your organization? Imagine being able to answer that question through analytics before yet understanding the reason why. Then picture how absurd it might be to deprecate a revenue-generating service using cost alone as a justification.
More To Define
There’s still a lot of room to figure out what the new processes and their implementation around FinDev will be. Those processes will be highly dependent on your organization and business. In addition, how to tie revenue generated down to the system or function level is still a largely unanswered question. There’s nothing out there in the market attempting to do that.
The process and practices for combining finance with engineering is still a growing and evolving area. Keep in mind, this is all closely aligned with why we adopted DevOps; to become more efficient and provide more value to our organizations. FinDev is just an extension of that and a new space serverless better opens up.
We're Not There Just Yet
Admittedly, right now you’re talking about small dollar amounts, and the cost saving over EC2 may already be enough not to care at the feature level. In the future, however, as we become accustomed to the low costs of serverless, we will care more. Similarly, many organizations still invest heavily in undifferentiated engineering. But as companies that focus more heavily on their core business logic through serverless begin to excel, we’ll see more organizations become interested in achieving that level of efficiency.
There's still more in our Serverless DevOps series. Check it out!
Read The Serverless DevOps Book!
But wait, there's more! We've also released the Serverless DevOps series as a free downloadable book, too. This comprehensive 80-page book describes the future of operations as more organizations go serverless.
Whether you're an individual operations engineer or managing an operations team, this book is meant for you. Get a copy, no form required.