Serverless DevOps: What do we do when the server goes away?

Beautiful divider with lightning bolt in the middle


We've expanded on the topic of Serverless Ops and what the role means for DevOps and Operations engineers. This is a part of our ongoing goal to prepare operations engineers for the coming world of serverless.

When I built my first serverless application using AWS Lambda I immediately became excited. The experience had me spending more time building my application and less time focusing on the infrastructure that was required to run it. I didn't have to think about how to get the service up and running, or even ask permission for the necessary resources. The result was an application that was running and doing what I needed quicker than I had ever experienced before. But, that excitement also led me to ask what my role would be in the future. If there were no servers to manage, what would I do? Would I be able to explain my job? Would I be able to explain to my current employer or a prospective one the value I provide? This was why I started ServerlessOps. I had this overriding question.

What do we do when the server goes away?

Starting This Conversation

Now is the right time for us to begin discussing what operations will be in a serverless world. What happens if we don't? It will be defined for us. At one end of the spectrum there are people proposing NoOps where all the operational responsibilities are transferred to software engineers. These views betray a fundamental misunderstanding of what operations is and its importance. Fortunately larger voices are already out there countering this attitude.

At the other end, there are people who believe operations teams will always be necessary and the status quo will remain unchanged. This view simply ignores the change that has been occurring over the past many years. I have seen, and personally been affected by, a shift in the operational needs of an organization due to shifts in technology. I have been in a meeting with engineering leadership and personally told my skills would soon no longer be necessary, and I would not be qualified to be on the engineering team any longer then. If DevOps and public cloud adoption hasn't effected your job yet, it's only a matter of time for most of us. Adopting a "they'll always need me as I am today" leaves you unprepared for change. 

Somewhere in a range between these views an actual answer exists. Production operations, through its growth in complexity, is expanding and changing shape. As traditional problems we experience and deal with today become abstracted away by serverless, we'll see engineering teams and organizations change. This will be particularly acute in SaaS product companies. However, many of today's problems will still exist: system architecture design, deployment, security, observability, and more.

The serverless community largely recognizes the value of operations. Operations is a vitally important piece of successfully going serverless.

Operations never goes away; it simply evolves in practice and meaning. Operations engineers and their expertise will possess tremendous valueBut, as a community we will be required to define a new role for ourselves.

These thoughts in mind, what does this series have to offer?

  • Exploration of operational concerns and responsibilities when much of the stack has been abstracted away.
  • A proposed description of the role of operations when serverless.


This series is a start at defining what I see as the role for operations in a serverless environment. However, I don't believe it's the only way to define the role. I think of operations in the context of SaaS startup companies. It has been awhile since I worked on traditional internal IT projects or thought of engineering without a more product growth oriented mindset. My problems and experiences aren't necessarily your problems and experiences. This is the start of a conversation. I think it's good when people disagree with the ideas, observations, and conclusions I've reached. (And they will.) Disagreement helps people find the right answers for different people and organizations.

Here's what I see as the future serverless operations.

What is “Operations”?

We need to start with a common definition of operations. People's understanding of operations is often different but in its own way correct. The choice of definition is just a signal of a person's experiences, needs, and priorities; and ultimately their point of view. But the end result of these divergent definitions is people talk past one another.

At the highest level, operations is the practice of keeping the tech that runs your business going. But below, “operations” can be used in a variety of ways. Because these meanings were tightly coupled for the longest time people tend to conflate these meanings

What is operations? It is a:

  • A team
  • A role
  • A responsibility
  • A set of tasks


Responsibility for operations was traditionally held by the operations engineer on the operations team who had operational responsibility and performed operations related tasks. The introduction of DevOps over the recent several years has significantly changed this. The rigid structure of silos that once kept these meanings tightly coupled were torn down. And with that the meanings started to break apart.

Soon developers started to assume operational responsibilities resulting in them being unable to toss code over a wall to an operations team. We see this change in operational responsibility when someone says “developers get alerts first.”

On the other end of the definition spectrum, some organizations grouped engineers with operational and software development skills together to create cross-functional teams. These organizations do not have an ops team, and they don’t plan on having one.

In the middle, the role has varied heavily. In some organizations, the role has changed little beyond the addition of automation skills (e.g. Puppet, Ansible, and Chef) to the role. Other teams have seen automation as a means to an end for augmenting their operations responsibilities. And in some cases operations engineers trend much closer towards a developer; a developer of configuration management and other tooling.

So what does serverless hold for these definitions of operations?

What will operations be with serverless?

The future I see of operations for serverless infrastructure is two-fold:

  1. The operations team will go away.
  2. Primary responsibility for operations is reassumed by the operations role, taking tasks back from developers.

Serverless operations is not NoOps. It is also not an anti-DevOps return to silos. If the traditional ops teams is dissolved, these engineers will require new homes. Their homes will be individual development teams, and these teams will be product pods or feature teams. This will lead to a greater rise in the formation of fully cross-functional teams who can handle both development and operations. It's a goal many of us see as the best expression of DevOps but have also failed to fully achieve.

Product Pods: Cross Function in Action

Product pods and feature teams are not new and many organizations have already implement such a team structure. They are multidisciplinary teams that exist to solve a particular problem or problem set. They are able to thrive through team members of different perspectives, experiences, and skill sets collaborating with one another

What does a product pod look? Where I've seen them in use they have typically resembled the following:

  • Product manager
  • Engineering lead
  • Frontend developer(s)
  • Backend developer(s)
  • Product designer and/or UX researcher (often the same person)


The product manager (PM) is the team member responsible for owning and representing the needs of the business. The PM’s job is to turn the needs of the business into clear objectives and goals while also leading the team effort in coming up with ideas to achieve success. The product designer or UX researcher works with the PM to gather user data and turn ideas into designs and prototypes. The tech lead is responsible for leading the engineering effort by estimating the technical work involved and guiding frontend and backend engineersappropriately

What you end up with is a small team solving a business problem using their differing skills. The team is made stronger by their cross-functional skill set which leads to the delivery of better systems

The Ops Role in a Product Pod

What will be the role of the operations engineer as a product pod member be? Their high-level responsibility will be the health of the team's services and systems. That doesn’t mean they’re be the first one paged every time. It means that they will be the domain expert in those areas.  

Software developers remained focused on the individual components, the operations engineer focused for the system as a whole. They’ll take a holistic approach to ensuring that the entire system is running reliably and functioning correctly. In turn, by spending less time on operations, developers spend more time on feature development.

The operations engineer also serves as a team utility player. While their primary role is ensuring the reliability of the team’s services, they will be good enough to offload, augment, or fill-in other roles when needed.

The Ops Skillset

In order to be effective in their new role, what skills will be required of the operations engeineer?

Let’s break this down into two sets:

  1. Skills currently possessed by the typical operations engineer today.
  2. Skills many operations people will need to level up or possibly begin acquiring.

The Skills Ops Bring

Let's start by running through the skills operations brings with their expertise. Historically these have been the primary skill set of operations engineers.

Platform/tooling understanding

Every engineer develops expertise in the platforms and tooling they regularly use. Frontend engineers develop expertise in JavaScript and its frameworks. (Or, they invent their own...) Backend engineers develop expertise in building APIs and communicating between distributed backend systems. As operations engineers, we develop expertise in platforms and tooling. In an environment running on AWS, we develop expertise in automation tools such as CloudFormation and Terraform. These are tools with a steep initial learning curve. We also have expertise in aspects of AWS. for example understanding the idiosyncrasies of different services. For example, Lambda is has a phenomenon called "cold starts" and its CPU scaling is tied to function memory allocation.

Systems engineering

The systems engineering skills are present when designing, building, operating, scaling, and debugging systems. As operations engineers we're required to develop an understanding of the overall system we're responsible for and in turn spot weaknesses, areas of improvement, and understand how changes may impact a system's overall reliability and performance.

People skills

Lastly, there’s people skills. While computers are fairly deterministic in their behavior, people are not. That makes people hard and we need to stop calling them soft skills. Much of the work of operations engineers is figuring out the actual problem a person is trying to solve based on the requests people make to them. A good operations teams is a good service team. They understand the needs of those they serve. These people skills, and particularly the ability to get to the root of a user's problem, will be complementary the product manager's role on the team.

The Skills We Need

Here’s the skills we’ll need to level up on or even still acquire. Because of the haphazard evolution of the operations role in different directions caused by different views of what DevOps is, this has resulted in an uneven distribution of these skills.


There's no avoiding the need to learn to code as an operations engineer.  And not just that, they’re going to need to learn the chosen language(s) of their team. Whether it’s Python (yea!), Go (looks interesting), JavaScript (well, okay), Java (ummm...), or Erlang (what, really???), the operations person will need proficiency in that language.

Keep in mind the ops person serves a utility player on the team. They’re not on the team to be just another developer. Being proficient in a language equates to being able to read it, code light features as well as fix trivial bugs, and doing some degree of code review. If you’re quizzing the them on b-tree or asking them about fizz-buzz solutions, you’re probably doing it wrong. However, if you’re asking them to code a basic function, or read a code block, explain it, and show potential system failure points, then congratulations! You’re doing it right.

On the positive side for those who are code-phobic, the code should be easier to understand. By isolating work into nanoservices the code should should be easier to comprehend and understand. This is in contrast to snaking your way through the large code base of a monolith or microservice.

The Ops Role Responsibilities

Once an operations engineer joins a product pod, what do they do day in and day out? What will be their job description? what tasks will they assume, create, and prioritize? 

We need to be able to articulate a set of responsibilities, explain the role, and articulate a job description. Management and coworkers not being able to answer, "What does that person do here?" leads to job instability. Plus, if you can’t explain your job, then someone else will for you… And that possibly means your job being explained away with “NoOps” as its replacement.

The ops role will reclaim responsibilities developers started to own as DevOps began to grow. Remember, this isn't a return to silos. They will not be the sole owner of these areas. The ops person will be the primary owner and tasked with enabling their fellow teammates to meet these challenges where appropriate.

Standards and Best Practices

Operations engineers will take lead responsibility for building reliable and performant serverless systems by establishing the systems standards, and best practices. We should be working to define the correct architecture patterns to use in the variety of use cases and situations encountered. To ensure these standards and best practices are adhered to we should be implementing them through infrastructure as code.

Just take a simple task like passing an event between the components of a serverless system. Most people will do what they know, and and stick to what they've previously done and are comfortable with. But even such a simple task can be solved in a variety of ways that may not occur to an engineer. When do you use SQS over SNS; or even both in conjunction? Maybe a Lambda function fanout? Perhaps event data should be written to persistent storage first, such as S3 or DynamoDB, and use events from those actions used to trigger the next action?

"Generally we should use SQS or Kinesis instead of SNS when writing to DynamoDB in this sort of application. They allow us to control the rate of writes. With SNS we're more susceptible to failures due to DynamoDB autoscaling latency."

What’s the best choice? The problem you’re trying to solve, requirements, constraints on you will dictate the best choice. (As an aside, I'm very opinionated about Lambda functions fanouts and will ask why you feel the need to handle function retries yourself. There are valid uses cases but IMHO, few.)

Build, Deploy, and Management Tooling

What is one of the major obstacles we all face as engineers? Understanding our tools. The tools of our trade come with significant complexity. But, as we grow to master them these tools become significantly more powerful.

Tooling like CloudFormation, and AWS SAM and Serverless Framework (both built on top of CloudFormation), can be complex. I find many developers are not fans of configuration and domain specific languages. Your dev team members, if left up to their own devices, WILL hardcode assumptions and thwart a lot of the problem solving CFN has built in through shear frusturation with the tools to manage their systems. CloudFormation is actually quite flexible if you're familiar and understand how to use it. One of Serverless Framework's improvements over CloudFormation is its plugin capabilities. Writing plugins will replace writing Puppet facts and Chef `knife` plugins.

"I could create a Serverless Framework plugin that will add a unit testing hook and run tests automatically before deployment. That could help streamline your development flow a little."

There is going to be testing and deployment work just as there is today. And if you look at even the current state of your testing and deployment, you may already see the work that will be needed tomorrow. Start with your basic unit testing frameworks for Python, Node, or Go. But wait, there’s more! What about integration testing? With so many small independent pieces there will need to be a greater emphasis on ensuring changes don't break systems. What about load testing? What does it mean for your data layer if your compute layer can rapidly scale? Will a rapid rise in Lambda invocations result in throttled DynamoDB operations? And how will you handle those those throttled operations? Can your SQS consumers keep up with the queue? The ops person should take ownership of implementing tools like Artillery to find scaling issues which they then take the lead on solving.

Once tests have passed, it’s time to deploy new code. Someone will need to design and build the process and systems to ensure graceful rollout. How will you do blue/green deployments if you choose that. What about implementing feature flags?


Metrics, monitoring, observability, and logging; we'll continue to own primary responsibility for these. What will change is the issues we'll be looking for in these areas. Host CPU and memory utilization won't be your concerns. (That is beyond the cost associated with higher memory invocations and the impact it has on your bill.) Now you'll be looking for function execution time length and out of memory exceptions.

It will be the responsibility of the operations person for selecting and implementing the proper tooling (e.g. IOpipe, Thundra, Dashbird, or Epsagon) for solving these issues when they surface. If a function has started to throw out of memory exceptions, the ops person should be responsible for owning the resolution of the issue. The solution may be as simple as increasing function memory or working with a developer to refactor the problematic function.

There's also a phenomena known as "cold starts" on Lambda that occur when a function's underlying container (there aren't just servers but also containers in serverless) is first deployed behind the scenes. You'll need to observe their impact on your system as well as determine whether changes need to be made to alleviate them.

"While this service sees a certain percentage of cold starts, the data is not required and consumed by the user in realtime. Cold starts have no actual user impact. Our engineering time is better spent worrying about the smaller amount of cold starts in this other system that is user interactive."

Your operational expertise and knowledge of the system as a whole will help inform you of whether these need to be addressed since cold starts aren't necessarily problematic for all systems. 

Performance Tuning and Scaling

Because they're both a part of modifying an existing running system, I've grouped scaling and performance tuning together. Effectively they're both about how to meet newly established system metric objectives. There is a newly established baseline for performance or scale, a target goal has been set, and the system requires modification to meet that goal. Trying to reduce response times to a certain target threshold or scaling a system to service a higher order of magnitude events are in my mind the same class of work.

Since individual components a serverless system scale automatically the amount of scaling work narrows. Your storage, database layers, messaging, and queueing scaling are handled for you. So, what work is left then?

Lambda functions have adjustable memory, which in turn adjusts the amount of CPU capacity allocated to the function. Adjusting the amount of memory is as a few simple as a clicks in the AWS console, or better, a new stack deployment via CloudFormation or Serverless Framework. There's no creating new compute instances and performing a rolling migration.

Too slow? New system requirements? There will be work at rearchitecting systems by replacing individual services with different ones that better match new needs. Or maybe the system as a whole and its event pathways need rethinking?

"We're going to need ordered messages and guaranteed processing here to accomodate this requirement. That means we'll need to swap SNS for Kinesis here."

Evaluating code changes will also be a part of achieving new scale and performance goals. You'll start by using the growing performance and reliability tools in the space to find and refactor inefficient functions.

You may even find yourself looking at rewriting a single individual function in a different language. Why? Because at a certain point in scaling Lambda memory, an additional CPU core becomes available and your current chosen language may not be able to utilize that core. The operations person will be responsible for knowing facts like this. (The ability to rewrite a single nanoservice function compared to rewriting an entire microservice is pretty cool!!?!?)

Rapid scaling ability has its own class of issues, such as thundering herd problems. How will a change in scale or performance affect downstream services? Will they be able to cope? Someone will need holistic view and responsibility for the system as a whole so they can understand how upstream changes will affect downstream.


Serverless, when using a public cloud provider, means not spending time on securing the host your code runs on. If you're familiar with the AWS shared responsibility model for security, then you're familiar with ceding responsibility, and control, over different layers of security to your public cloud provider. As an example, physical security and hypervisor patching is the responsibility of the public cloud provider.

This is a good thing! AWS has a large dedicated security team and even in some of the most sophisticated of companies, there is usually more work than a security team can handle. (This assumes you even have a security team which many organizations simply roll into the operations position already.) The shared responsibility model let's you spend less time on these areas and more time focusing on other security layers. The less you have to worry about the greater you can focus on what's left.

With serverless, the cloud provider assumes even more responsibility. Gone is the time spent worrying about host OS patching. Take the recent Meltdown and Spectre attacks. AWS Lambda required no customer intervention. AWS assumed responsibility for testing and rolling out patches. Compare this with stories about the time spent in some organizations tracking patch announcements, testing, rolling out patches (and rolling back in some cases), and the overall overhead incurred as a result of the disclosure. A month after disclosure, just one third of companies had patches only 25% of their hosts.  Moving more work to the large dedicated security teams who support the public cloud providers is going to enhance the security posture of most organizations.

So and what is the responsibility of an operations person with regard to the security of serverless systems? To start, any infrastructure change may result in system reliability issues and the same goes for security updates. AWS makes regular changes to their infrastructure; some announced but most not. How do you handle those changes today? You have alerts for system instability and performance degradation. When AWS announces security updates to the EC2 hypervisor layer you watch dashboards more closely during the process. If a system is that critical to your organization, then you should be investigating a multi-region architecture and the ability to fail over or redirect more traffic to a different region. None of this changes with serverless but it's worth reiterating that if you're using a public cloud provider then the impact of security changes and updates you don't control is already something you deal with and handle.

"Team, someone has disclosed a vulnerability with a website and logo. Keep a closer eye on alerts and the reliability of your services. AWS has released a bulletin that they will be rolling out patches to their infrastructure."

As for securing serverless infrastructure and systems, let's discuss a few areas. This isn't an exhaustive list of what to secure but it provides a good overview of where to start.

Start with the most basic of low hanging fruit. Are there S3 buckets which should not be open? In fact, if you're going serverless you'll find yourself hosting frontend applications and static assets in S3, which means determining what buckets should be and what buckets should not be open.

Similar to S3 bucket permissions, spend more time auditing IAM roles and policies for least privileged access. It can be tempting for a developer to write IAM roles with widely permissive access. But if a Lambda function only needs to read from DynamoDB then ensure it doesn't have permission to write to it. Ensure the function can only read from the one table it's intended to read from too! This may sound obvious but the current state of cloud security and mishaps that occur make this all bear repeating.

Understand that not everyone has the same level of knowledge or sophistication when it comes to this topic. A developer may not know the difference between DynamoDB GetItem and BatchGetItem operations, but they know they can write dynamodb:* and be unblocked. The developer may now know how to get the DynamoDB table name from their CloudFormation stack, but they know they can use a wildcard and be unblocked. The operations member of the team should be finding these issues, correcting them and educating their team on best practices.

"I see you have a system that writes to a DynamoDB table. I went over the IAM role for it and it's too wide. Do you have time for me to show you how I fixed it and hat you can do in the future?"

Also, ensure that event and action auditing services, for example AWS CloudTrail, is setup. This will give you visibility into what's occurring in the environment. Repeated AccessDenied failures, you want to know about that. Activity in an unexpected AWS region, you want to know about that.

"I don't know how but someone is mining bitcoin in our account over in ap-southeast-2..."

Services and tools like Threat Stack (disclosure: I am a former employee), CloudSploit, and Cloud Custodian will help detect cloud infrastructure security issues and alert you to the need for a response.

All this so far and we still haven't touched on:

  • Dependency management
  • Application Security
  • Secrets Management
  • And automating security testing via CI/CD.

If you'd like to know more about serverless security then listen to our podcast appearance with Signal Sciences on DevOps and Security. Also, both Protego and PureSec, both serverless focused security product companies, have good additional information to help you get started and establish best practices.

Cost Management

The consumption, as compared to provisioned capacity, based cost of serverless creates new and unique challenges. But it also creates new opportunities for factoring cost in to decision making too. Operations people should keep an eye out for what should and should not be serverless based on a cost perspective, and for what is serverless, they should be looking for both wasted money (something they often already do) and potentially, perhaps in conjunction with the PM, helping to prioritize "worth-based development" which we'll discuss more later.

To start, there can be a degree of difficulty in determining what is suitable from a cost perspective to run serverless as opposed to in a container or on an EC2 instance. Determining the cost benefit of moving a few cron jobs, slack bots, and low use automation services is easy.  But once you try and figure out the cost of a highly used complex system, the task becomes harder. If you're attempting to do a cost analysis pay attention to these three things:

  • Compute scale
  • Operational inefficiency
  • Ancillary services


Start by ensuring you're comparing apples to apples as well as you can in terms of compute.  That means first calculating what the cost of a serverless service would be today as well as say a year from now based on growth estimates. Likewise, make sure you're using EC2 instance sizes you would actually use in production as well as the number of instances required. Next, account for operational inefficiency. A service may be oversized for a few reasons. You may only need one instance for your service but probably have more for redundancy. You may have more than you need because of traffic bursts. You may have more than you need because someone has not scaled down the service from a previous necessary high. Lastly, think about ancillary services on each host. How much do your logging, metrics, and security SaaS providers cost per month. All these will give you a more realistic approach to cost.

"This will cost you more than $5 per month on EC2 because you cannot run this on a single t2.nano with no metrics, monitoring, or logging in production."

The major cloud providers release case studies touting the cost savings of serverless and I've had my own discussions. I've seen everything from "serverless will save you 80%" to "serverless costs 2X as much". Both are true statements. You can't cargo cult another organization's accounting. Does the organization that saved so much money have a similar architecture to yours? Is the 2X cost a paper calculation or one supported by accounting? The organization which gave me the 2X calculation followed up with operational inefficiency and ancillary service costs eating up most of their savings, to the point they considered serverless and EC2 even.

Next, let's talk about how to keep from wasting money. Being pay per use, reliability issues have a direct impact on cost. Bugs cost money. Let's look at a few examples of how.

Recursive functions, where a function invokes another instance of itself at the end, are a thing and a vlid design choice for certain problems. For example, attempting to parse a large file may require invoking another instance to continue parsing data from where the previous invocation left off. But anytime you see one, you should ensure that the loop will exit or you may end up with some surprises in your bill. (Ouch! This has happened to me...)

Lambda has built in retry behavior. Retries are important for not just building a successful serverless application, but a successful cloud application in general. But each retry costs you money. You might look at your CloudWatch metrics and see a function has regular rate of failed invocations but you know from other system metrics that the function eventually successfully processes events and the system as a whole is just fine. (As an example, you may have an API endpoint that you periodically trigger rate limits on and with no way to handle that exception your function errors out.) While the system works fine, those errors are an unnecessary cost.

How do you handle spotting these issues as an operations person? First, apply the general responsibilities for reliability. Know and understand the behavior of your systems and alert when behavior is out of bounds. Just through the course of monitoring for failed invocations you're also monitoring for potential wasted compute.

More interestingly, monitor for system cost throughout the month. Is the system costing you the expected amount to run? If not, why? Is it because of inefficiency or is it a reflection of growth? If you plan on tracking system cost over time, enable cost allocation tags in your AWS account. And if you do this, there's room for new potential with serverless. Let's briefly discuss application cost monitoring and FinDev.

The ability to measure down to the function, and potentially feature, level is something I like to call “application cost monitoring”. If you track cost per function invocation and overall system cost over time, and then overlay feature deploys events, you have the ability to understand system cost changes at a much finer level. Just picture cost as another system metric.

Application cost monitoring in turn helps yield a new practice popularized by Simon Wardley called FinDev. Down the road, an organization may even be able to tie features to revenue and be able to make smarter decisions about feature value and feature work. The ability to more efficiently perform worth-based engineering gives an organization a competitive edge over those less able to validate the worth of the work they're doing. Admittedly, right now you’re talking about small dollar amounts, and the cost saving over EC2 may already be enough to not care at the feature level. However, in the future as we become accustomed to the low costs of serverless, we will begin to care more.

Coding and Code Review

Lastly, operations will take a greater role in coding and the code review process. The exact level of participation by the operations person will differ depending on the team and organization of course. Different organizations and teams will have different needs and these abilities will vary among operations people across different organizations. If you're in an organization that doesn't expect or highly value coding among operations engineers there will inevitably be a more limited coding scope than in an organization that does expect and value the skill. However, that will only last so long as operations engineers will (and should) be expected to level up to increase their effectiveness in this area. 

At a high level, what responsibilities be expected of an operations engineer around code?

First, the operations engineer will need to be proficient in the languages in use by the team. As a result, they find them self capable of fixing light bugs. While carrying out their reliability responsibilities and investigating errors an operations engineer shouldn't just stop at determining probably cause and filing a ticket. They should always seek to go further into the code involved. That means isolating potentially problematic code, at least to the function instance, and fixing minor issues.

"This code expects an int, but the data actually contains a string representation of an int. Let me handle this."

Finally, if they can't fix the bug, they should write a thorough bug report. Guide the developer who will fix the issue as close to the problem source that you found in your investigation. Here, communication skills will be of the highest importance.

Ideally, we should also eventually level up to be able to code simpler tasks and features. The only way to stay sharp in a skill, and for operations engineers to level up in this area, is to do some of the work. Work that could be given to a more junior engineer could instead be assigned to the operations engineer. In this capacity, the operations engineer is provides extra development capacity. As the coding ability of the operations engineer improves, they can be called on more to augment the team's output when deadlines are tight or workload has become to large.

"I can create a REST API with an endpoint that ingests the data from this service's webhook, enriches the data, and passes it onto this other system."(Inspiration from Alice Goldfuss.)

Finally, there’s code review. We should be bringing our unique knowledge and perspective of how the platforms we use work, and also think about the code as a part of the greater system. I was personally fortunate enough a few years ago to work with a very good operations minded developer. They taught me how to write more reliable code by teaching me during review the ways in which cloud systems fail. It's not that I didn't already know these things, I just made assumptions that operations would succeed and often didn't account for failure. Review lead to the team building more reliable services.

“This function should have retries because we may fail at the end here. However, you can’t safely retry the function due to this earlier spot. We’ll end up with duplicate DynamoDB records.”

One problematic area I see which I expect will generate much friction is developers incorrectly evaluating the coding skills of an operations engineer. Rather than judging operations engineers on writing the fastest, most efficient, or pedantically and subjective "best" code,  they are there to help help the team produce the most reliable code. As only a part-time software developer, treat an operations engineer as you would a junior developer. ensure they're writing, clear, manageable, and reliable code that solves a defined problem. Beyond that, the skills required are that of a more senior software developer. Give work requiring a senior developer to a senior developer. Otherwise, the operations engineer will be setup to fail, and the team as a whole will fail to achieve what other more cohesive teams can.


There’s a lot in this post that’s assumed, unexplained, and undefended. Over the coming months, we’ll be expanding on the ideas and themes here. This piece has hopefully generated questions from people that will require further answering. Finally, what the ideas in this piece look like in practice and how to get there requires practical explanations.

Lastly, how do you feel you or your organization is prepared for this shift? At ServerlessOps we specialize in DevOps transformation and AWS cloud transition leveraging serverless technology. Talk to us about our services to assist your organization moving forward and preparing you for this shift in cloud technology. 

A thank you goes out to several people who contributed feedback while I was writing this. They were Brian Hatfield, Travis Campbell and others in #atxdevops on HangOps, Ryan Scott Brown, Ben Bridts, Rob Park, and Gwen Betts.

We're ServerlessOps

We help you design, build, and run reliable serverless systems in AWS. Whether you're a startup, cloud native, or just beginning your cloud journey, we're here for you. Learn more about the services we offer, and how we can help you successfully accomplish your serverless operations goals.