Serverless DevOps: The Need for Ops

This is a part of our Serverless DevOps series exploring the future role of operations when supporting a serverless infrastructure. Read the entire series to learn more about how the role of operations changes and the future work involved.

blog-serverless-devops-book

Ops never goes away...

Now let’s start to talk about what operations people bring to a team. These skills are going to provide greater depth to a product pod’s overall depth of skills. But, there’s also a skill, coding, that we’re going to need if we haven’t yet developed it. Then from there, let’s talk about some of the areas of operations for serverless applications and how by being serverless the work evolves.

Essential Ops Skill Set

In order to be effective in their new role, what skills will be required of the operations engineer? There’s a variety of skills that operations people have acquired over time that will serve the team well. But there’s also skills we’ll need that many of us are weak on.

Let’s break this down into two sets:

Skills currently possessed by the typical operations engineer today.
Skills many operations people will need to level up or possibly begin acquiring.

The Skills Operations Engineers Bring

Let's start by running through the skills operations brings with their expertise. Historically these have been the primary skill set of operations engineers.

Platform/Tooling Understanding

Every engineer develops expertise in the platforms and tooling they use regularly. Frontend engineers develop expertise in JavaScript and its frameworks (or they invent their own). Backend engineers develop expertise in building APIs and communicating between distributed backend systems.

As operations engineers, we develop expertise in platforms and tooling. In an environment running on AWS, we develop expertise in automation tools, such as CloudFormation and Terraform. These are tools with a steep initial learning curve.

We also have expertise in aspects of AWS; for example understanding the idiosyncrasies of different services. Lambda, for instance, has a phenomenon called "cold starts" and its CPU scaling is tied to function memory allocation.

Systems Engineering

The systems engineering skills are present when designing, building, operating, scaling, and debugging systems. As operations engineers, we're required to develop an understanding of the overall system we're responsible for and, in turn, spot weaknesses and areas for improvement, and understand how changes may impact a system's overall reliability and performance.

People Skills

Finally, there are people skills. While computers are fairly deterministic in their behavior, people are not. That makes people hard and we need to stop calling the ability to work with them a “soft skill.”

Much of the work of operations engineers is figuring out the actual problem a person is trying to solve based on the requests people make to them. A good operations teams is a good service team. They understand the needs of those they serve. People skills and the ability to get to the root of a user's problem will be complementary to the product manager's role on the team.

The Skills We Need

Here are the skills we’ll need to level up on or acquire wholesale. Warning: because of the haphazard evolution of the operations role in different directions because of different definitions of DevOps, there is an uneven distribution of these skills.

Coding

There's no avoiding the need for operations engineers to learn to code. And they’re going to need to learn the chosen language(s) of their team. Whether it’s Python (yea!), Go (looks interesting), JavaScript (well, okay), Java (ummm...), or Erlang (what, really???), the operations person will need proficiency.

Keep in mind the ops person serves as a utility player on the team. They’re not there to be just another developer. Being proficient in a language means to being able to read it, code light features, fix trivial bugs, and do some degree of code review. If you’re quizzing operations job candidates, or even existing operations employees, on b-tree or asking them about fizz-buzz solutions, you’re probably doing it wrong. If you’re asking them to code a basic function or read a code block, explain it, and show potential system failure points, however, then congratulations! You’re doing it right.

For those who are code-phobic, don’t fret. You won’t have to snake your way through the large code base of a monolith or microservice. By isolating work into nanoservices, the code should be easier to comprehend.

Areas of Responsibility

Once an operations engineer joins a product pod, what do they do day in and day out? What will their job description be? What tasks will they assume, create, and prioritize? We need to be able to articulate a set of responsibilities and explain the role. If you can’t explain your job, then someone else will do it for you. And that could mean your job is explained away with “NoOps” as its replacement. Let’s lay out some areas of responsibility operations people should be involved in when it comes to supporting and managing serverless systems.

Architecture Review

Serverless systems are complex systems and in order to remain manageable they need to be well architected. The ability to quickly deliver a solution also means the ability to quickly deliver a solution that looks like it's composed of duct tape, gum, and popsicle sticks. Or, it looks like something out of a Rube Goldberg machine.

Just take a simple task like passing an event between the components of a serverless system. Most people will do what they know, and and stick to what they've previously done and are comfortable with. But even such a simple task can be solved in a variety of ways that may not occur to an engineer. When do you use SQS over SNS; or even both in conjunction? Maybe a Lambda function fanout? Perhaps event data should be written to persistent storage first, such as S3 or DynamoDB, and use events from those actions used to trigger the next action?

"Generally we should use SQS or Kinesis instead of SNS when writing to DynamoDB in this sort of application. They allow us to control the rate of writes. With SNS we're more susceptible to failures due to DynamoDB autoscaling latency."

What’s the best choice? The problem you’re trying to solve, requirements, constraints on you will dictate the best choice. (As an aside, I'm very opinionated about Lambda functions fanouts and will ask why you feel the need to handle function retries yourself. There are valid uses cases but IMHO, few.) Serverless systems introduce significant amounts of complexity and that can't be denied. Couple that with the multitude of ways that even a simple task can be solved and you have a recipe for unmanageable systems.

The operations person should be able to explain what architecture patterns are most appropriate to solve the given problem. Start by ruling out patterns and then explaining the remaining options and their pros and cons. And, explain your reasoning to other engineers so they can make the best choice on their own the next time.

Reliability

Metrics, monitoring, observability, and logging; as operations engineers we'll continue to own primary responsibility for these. What will change is the issues we'll be looking for in these areas. Host CPU and memory utilization won't be your concerns. (That is beyond the cost associated with higher memory invocations and the impact it has on your bill.) Now you'll be looking for function execution time length and out of memory exceptions.

It will be the responsibility of the operations person for selecting and implementing the proper tooling for solving these issues when they surface. If a function has started to throw out of memory exceptions, the ops person should be responsible for owning the resolution of the issue. The solution may be as simple as increasing function memory or working with a developer to refactor the problematic function.

There's also a phenomena known as "cold starts" on Lambda that occur when a function's underlying container (there aren't just servers but also containers in serverless) is first deployed behind the scenes. You'll need to observe their impact on your system as well as determine whether changes need to be made to alleviate them.

"While this service sees a certain percentage of cold starts, the data is not required and consumed by the user in real time. Cold starts have no actual user impact. Our engineering time is better spent worrying about the smaller amount of cold starts in this other system that is user interactive."

Your operational expertise and knowledge of the system as a whole will help inform you of whether these need to be addressed since cold starts aren't necessarily problematic for all systems.

This are just some of the things that can go wrong with a serverless system. This isn't by any means an exhaustive list of what can go wrong of course. And often, something we as ops people are tasked with responding to is something we didn't even initially think of. So instead of continuing to enumerate all the possible ways that systems can fail, let's spend more time on how to catch and respond to failure.

Keep in mind, even the simplest serverless application is a distributed system with multiple services that are required to work with one another. Your systems engineering knowledge is going to come in very handy.

Performance and Maintaining SLOs

As your application incurs greater usage, can you still maintain your established service level objectives (SLOs)? What about when you're tasked with performance tuning a running system to meet a new SLO target, what do you do?

There will be work at rearchitecting systems by replacing individual services with different ones that better match new needs. Or maybe the system as a whole and its event pathways need rethinking? Different AWS services have different behaviors and characteristics and the right choice initially may no longer be the best choice.

"We're going to need ordered messages and guaranteed processing here to accommodate this requirement. That means we'll need to swap SNS for Kinesis here."

Evaluating code changes will also be a part of achieving new scale and performance goals. You'll start by using the growing performance and reliability tools in the space to find and refactor inefficient functions.

You may even find yourself looking at rewriting a single individual function in a different language. Why? Because at a certain point in increasing Lambda memory, an additional CPU core becomes available and your current chosen language may not be able to utilize that core. The operations person will be responsible for knowing facts like this. (The ability to rewrite a single nanoservice function compared to rewriting an entire microservice is pretty cool!!?!?)

Scaling

We're used to scaling services by adding more capacity in some way. Either we scale a service vertically (giving it more CPU, RAM, etc.) or we scale a service horizontally (adding more instances of the service.) When it comes to scaling work we're used to either migrating a service to a new larger host or adding additional hosts.

But serverless has scaling, typically horizontal scaling, built in though. That's one of it's key characteristics that makes something serverless! So what's left to scale.

The ability to rapidly scale creates its own class of issues, such as thundering herd problems. How will a change in scale or performance affect downstream services? Will they be able to cope? A lot of what's been discussed in the previous sections with regard to performance and reliability are actually by products of serverless systems' ability to scale so easily.

You're not responsible for scaling individual services, you're responsible fo scale entire systems now. Someone will need holistic view and responsibility for the system as a whole so they can understand how upstream changes will affect downstream.

Security

Finally let’s touch on security. There’s a regular debate among engineers about whether giving up control over security responsibilities makes you more or less secure. I take the argument that offloading those responsibilities to a reputable cloud provider such as AWS makes you more secure. Why? It’s because they can focus on issues lower level in the application stack which frees you to focus your attention elsewhere? What does that mean in practice?

To begin, you’re no longer responsible for host level security. In fact let’s take that statement a step further. You’re not responsible for virtualization hypervisor security, host level security, or container runtime security. That is the responsibility of AWS and AWS has proven to be effective at that. So now you’re not responsible for this, what are your responsibilities?

To start, there’s the cloud infrastructure controls. This mostly boils down to access controls. Does everything that should have access to an AWS resource have access to it, and access is limited to only those with that need? That starts with ensuring S3 buckets have proper access controls and you’re not publicly leaking sensitive data. But that also means ensuring a Lambda function that writes to a DynamoDB table only has the permission to write to that DynamoDB table; not read from it for write to other tables in your environment.

If AWS is handling the bottom of your technical stack where should you be focusing? Up towards the top of course! That means greater emphasis on application security. Don’t have to monitor and patch host vulnerabilities anymore? Good, go patch application vulnerabilities such as those in your application’s dependencies.

Finally, You’ve heard about SQL injection before where SQL queries are inserted in requests to web applications, resulting in the query’s execution. Well now think about event injection, where untrusted data is passed in Lambda trigger events. How does your application behave when it receives this data? There’s a wealth of attack techniques to be rethought and see how they apply to serverless applications.

Operations Doesn't End Here

This isn't the entirety of all the operations work that will be necessary to build and run serverless systems but it's a start. In coming chapters we’ll expand on some of the skills mentioned as well as some of the areas of responsibility.

There's still more in our Serverless DevOps series! Read the next piece in our series The Need To Code.

Read The Serverless DevOps Book!

But wait, there's more! We've also released the Serverless DevOps series as a free downloadable book, too. This comprehensive 80-page book describes the future of operations as more organizations go serverless.

Whether you're an individual operations engineer or managing an operations team, this book is meant for you. Get a copy, no form required.