Serverless DevOps: The Work Of Operating Serverless Systems

This is a part of our Serverless DevOps series exploring the future role of operations when supporting a serverless infrastructure. Read the entire series to learn more about how the role of operations changes and the future work involved.

blog-serverless-devops-book

...operations just changes.

We've covered some of the areas where operations knowledge is needed with serverless systems, including reliability, performance and architecture review. But what is the actual operations work involved? Let’s walk through some practical areas where an operations person should be involved.

What Will Ops Do?

The work described in this section is well suited for an operations engineer. It isn’t everything, of course, but it should give you an idea of an average work day. Use this as a base, and then expand the scope of responsibilities as you identify other work to be done. A lot of what you read will resemble work you’re already doing. This is why we sometimes call serverless operations “DifferentOps”.

Systems Review

Let's start with system reviews. Throughout the service delivery process, the operations engineer should be involved to ensure the most appropriate infrastructure decisions are made. If you've started your DevOps transformation, then you should be doing that already. This is what I mean by tearing down the silos and refraining from throwing engineering projects over the wall.

Take time to review your team’s systems throughout all states of the delivery lifecycle and look for potential issues. Use your knowledge of operating systems and experience in seeing things go wrong in order to find potential problems and suggest solutions. Educate the rest of your team so they can make better decisions in the future, or at least know the questions to ask.

System Monitoring

Let's start with the basics of reliability: monitoring. As you look at the architecture of a serverless system, you should be looking for how it might fail. Or, if this is after an incident, you should be looking for the reasons behind a service failure. If there are states the system should not be in, then you want to know about them.

You should be doing your best to assure that when a system reaches an undesired state, an alarm like that from AWS CloudWatch is triggered. You should be doing that already, but what's unique to serverless?

AWS CloudWatch alarms are just AWS resources, and that means you can manage them with CloudFormation. Don't divorce your serverless code and system configuration from your monitoring alerts. Don't have one in source control and the other manually configured in the AWS console. If you're using CloudFormation, AWS SAM, or Serverless Framework, then add CloudFormation resources for CloudWatch Alarms.

You can't be expected to review every single new service and put your stamp on it to ensure proper monitoring. In fact, that’s probably not going to fly. Making yourself a gatekeeper to production removes one of the key values of serverless, which is speed of service delivery. Instead, enable the rest of your team to do that work on their own. As mentioned later in Build, Testing, Deploy, and Management Tooling, the rest of your team may not be as fully versed in your tooling, particularly CloudFormation syntax.

One reason I like Serverless Framework is the ability to extend it through plugins. Take a look at this plugin, serverless-sqs-alarms-plugin. It simplifies the configuration for adding a multiple SQS queue message size alarm. If you want to enable your teammates but they struggle with CloudFormation, write more plugins to abstract away the complicated parts and make configuration simpler for them.

Function Instrumentation and Observability

A close cousin of monitoring is observability. Observability is the ability to ask questions and get answers about your system and its state. While monitoring is raising the alarm on known bad states, observability is providing enough detail and context about the state of the system to answer the unknown.

A key facet of observability is instrumentation. Instrumentation is the collection of data about your system that lets you observe it and ask questions. Ideally, the function's author should be instrumenting a function. They're in the best position to do this work because they should have the best understanding of what the function does.

But observability is all about dealing with unknowns, and that means the proper instrumentation to answer your question may not be in place. If that's the case, you as the operations person should be ready and able to instrument a function with additional code. If you need to add additional measurements, trace points, labels, etc., to solve a problem, then you should be in a position to work with your observability platform's SDK to do so.

I bring this up because you should be expected to do this work as a part of your operations reliability responsibilities, but also because for those coming from a background with less code experience, this is a good start for getting familiar at working within a code base. It's going to force you to read code to find the relevant parts you need to observe, add additional code, and go through the proper process for deploying your change. It might seem small to those in operations who are used to working with code, but it will probably be a leap for many ops people.

Service Templates

A complaint I've heard regularly about serverless is the amount of configuration overhead incurred. Each new project requires AWS CloudFormation or Serverless Framework configuration.

I have a standardized project layout I use to make finding what I'm looking for quicker and easier. Services require test coverage and a CI/CD pipeline. As I established new best practices and standards I wanted to adhere to, the more work it became to create a new service. Soon, the amount of overhead in creating a new project became frustrating.

My solution was to investigate Serverless Framework templates used during project creation time. I realized I could create my own templates to incorporate the standards I expected my services to meet. I now have a template for AWS Lambda with the Python 3.6 runtime.

This template saves me an incredible amount of time. New services are initialized with a standard project skeleton. A basic Serverless Framework configuration is created. A project testing layout is created and configuration is created. A TravisCI configuration is created. This might seem simple, but it's a way for you to ensure that people are creating new services in a standardized way.

Eventually, I want to take this a step further and create a library of referenceable architecture patterns. That library would demonstrate different infrastructure patterns, explain their pros and cons, and, finally, explain when each pattern should be employed. By creating this library I'm empowering the rest of the team to make better decisions on their own. The better we can empower developers to make good decisions up front, the less time we spend in meetings correcting errors.

It would also show the configuration necessary to deploy the given infrastructure pattern. Much of my time is spent in AWS CloudFormation docs trying to remember how to do something I've already done before. Imagine the amount of time I spend in them figuring out how to deploy infrastructure I haven't deployed before. Then image how frustrating it must be for a developer who doesn't have the same level of experience.

Alleviating this frustration and helping developers deliver more easily is a part of our job and this is an area we can greatly help a team with.

Managing SLOs

The operations person will be responsible for ensuring that service-level objectives (SLOs) are defined and that they're being met. What metrics does a service need to meet in order to be considered functioning normally? When they're out of bounds, what does the team do?

The first step to managing SLOs is creating them. Your SLOs aren't arbitrary values. They represent the expectations and needs of your organizations and its users. To establish these means understanding those needs and translating them into engineering. You should be working with the team's product manager to understand what the business is trying to achieve. You should also be working with a UX engineer, if you have one, to understand the needs of your user. Remember these important words.

"9s don't matter if users are unhappy."

Keep in mind that SLO targets may change over a system's lifetime. For example, your organization may determine a new SLO target for a service. Why? Maybe your organization has determined that decreasing page load time increases user retention. To achieve that, the variety of services involved in rendering that page, including services owned by your team, need to meet a new target. People are going to have to navigate the affected systems and look for areas of optimization to meet the new targets.

Also keep in mind that as usage scales, you may begin to expose limitations in your systems. A system that serves 10,000 users may not be able to serve 100,000 users and still meet its targets. We'll come back to this, though.

When you have SLO targets defined, then you must actively follow them. Don't wait for services to cross your thresholds in production before you react. Be proactive. In production, keep tabs on system performance metrics compared to target metrics. If a system is getting close to its threshold, then budget some time to have a look at areas of improvement. In your CI pipeline before you even reach production, look at writing and include performance tests, as well.

Usage Scale Point Management

I mentioned earlier that a system that serves 10,000 users may not be able to serve 100,000 users and still meet its SLO targets. You might be tempted to move the goalposts for your system so it can support 100,000 users right from the start. Premature optimization, however, isn't the answer. Know both your expected usage scale targets as well as the scale points at which your system can no longer meet your SLOs.

This starts in the planning phase. The team's PM should have realistic target usage metrics. It's tempting to believe your new application or feature will be a runaway success that everybody will want to use, but that's rarely the case.

Target usage projections aren't exactly a science, but you have to start somewhere. If you only expect to have 2,000 users by the end of the year don't architect a system to support 100,000 users from day one. By the end of the year you might find that you failed to achieve the expected growth and what you’ve built is going to be scrapped. Any extra engineering to support 100,000 users would not have been worth it.

With target usage established, stress your system as it's being developed and after it's in production. Ensure you've left runway for growth, too. As you're closing in on your break points budget time to refactor those points.

How do you find those points? Try a tool such as Artillery. Run controlled tests to find at what usage scale your applications breaks and how it breaks.

The New Work of Ops

The work described in this section isn’t exactly new. We’ve read blog posts and seen conference talks on this work. We may even know people who do this regularly. But the sort of organizations doing this work regularly are typically highly mature and advanced. These are things we should be doing but many of us are not yet.

This work is not serverless specific or a result of serverless technology itself. I mention these ideas because of the organizational change effects that serverless creates and the result of increased time availability. Serverless isn’t merely a technical change but it has ripple effects on your organization. Use your new time wisely. When in doubt about what to do, look for areas of work you’re not doing currently but feel you should.

Incident Commander Management

Let's start by stating what this is and what this isn't. As the operations person, you're not the first responder to every alarm. That's just a return to the "shipped to prod, now ops problem" mentality. Second, you're not the incident commander (the person directing the response) during every incident. That's going to breed a culture where people learn only enough about the systems they support until you take over.

Ideally, the operations person should become the individual who oversees the entire incident command process. Your role as the incident commander manager isn't serverless-specific, but a function of adopting the product pod model. As the operations person, you're probably the most experienced and qualified in this area.

Incident command doesn't start at the beginning of an incident and end at resolution or, better, a post-incident (post-mortem) review. New people need to be leveled up on how to be a proper first responder and incident commander. They should be trained on what to do before they wake up at 3 a.m. As the incident commander manager, be the person in charge of ensuring the rest of the team can find what they need to solve an issue. Additionally, the process should always be going through refinement.

Chaos Engineering

The new discipline of chaos engineering deserves a special mention. The cloud can be chaotic, and with serverless your placing more of your infrastructure into the hands of your cloud provider. Being prepared for failure, particularly failure that you can't control, will be increasingly important.

To start, what is chaos engineering? Chaos engineering is the discipline of experimenting on systems to understand weaknesses, and build confidence that a system can withstand the variety of conditions it will experience. It's an empirical approach to understanding our systems and their behavior, and improving them.

If you're no longer responsible for operating the message queue service in your environment, for example, then you have more time to understand what happens when it fails and improve your system's ability to handle it. Use your newfound time to plan events such as game days, where you test hypothesis and observe not just individual reaction, but team reaction, as well.

Chaos engineering takes times. It takes time to formulate hypotheses. It takes time to compose and run experiments. It takes time to understand the results. And it takes time to implement fixes.

But you should have that time now. Let concepts like chaos engineering and game days go from something you only hear others to things you can actually implement in your own environment.

There's Still More

This, of course, isn't the entirety of the work required to operate serverless systems successfully, but it is a start. In fact, cover security as well as the build, test, and deploy along with financial management lifecycle separately.

There's still more in our Serverless DevOps series! Read the next piece in our series Build, Testing, Deploy, and Management Tooling.

Read The Serverless DevOps Book!

But wait, there's more! We've also released the Serverless DevOps series as a free downloadable book, too. This comprehensive 80-page book describes the future of operations as more organizations go serverless.

Whether you're an individual operations engineer or managing an operations team, this book is meant for you. Get a copy, no form required.