NOTE: this is still a work in progress but I wanted to make it available asap. There isn't any order to these topics, the list isn't complete, and some are missing details. I will continue working on this and remove this warning at some point

This post isn't about which server or UI framework to use. It's about the other things that contribute to a product's quality, ease of operation, and developer experience. It applies to both new and existing services. After a few decades working on web services with a broad spectrum of organizations and technologies, this is ideally how I would set up a web service.

There are two things I'd like you to take away from this:

Developer experience should be a top priority. Devex can make or break projects. Your developers are your first customers. Delight them
Use tools to monitor and operate your service. You should be aware of issues long before your customers report them.

Since there are many ways to do these things, I'm not prescribing a detailed approach. So it's a little hand-wavey, but it hits the important points.

Infrastructure as Code

Clicking around cloud consoles is good for exploring and trying things out, but it has no place in your service. Start from the very beginning with good IAC. Terraform is cool, but I'm a sucker for defining infrastructure in a high-level language like TypeScript. So if you're on AWS, use CDK. If you're on Azure or GCP, use Pulumi.

Put everything you can in IAC. This includes IAM permissions, buckets, message queues, clusters, lambdas, metrics, alarms, dashboards - everything. If you can do it in IAC, don't do it by hand. Set it up so you pass in the environment (dev/staging/prod) as an argument and it takes care of the rest, like npm run deploy --env staging

Tips:

Include the environment name in the resource. This may seem redundant but when all you're looking at is the resource ARN, it's good to know if you're talking about prod
Tag everything you can with the environment name. This will make it easier to look things up later, as well as viewing costing information
CDK and Pulumi have flags that allow you to prevent a resource from being actually deleted when it's removed from IAC. Turn this on for anything involving customer data in prod. This prevents a potential IAC bug from, say, deleting your production database.
Your CI/CD should be set up in a way that infra changes are automatically applied. Don't do this manually, especially in production.

Accounts

Get your cloud account structure right from the beginning. It's really hard to rearrange things later. Consider separate accounts per environment. Use federated logins.

Structured Logging

When you emit a log message, you have the opportunity to include other metadata beyond that message. That's because cloud providers support "structured logging", where instead of just a message string, you pass in a JSON object.

Say you are emitting this log message: User 123 had 3 failed password attempts. Locking out. Then later you want to find how often this occured, you want a list of all user IDs that had this problem, or you want to get all the logs for user 123. Doing any of those things will involve painful parsing of the message with brittle regexes that will break if you ever change that log message.

Instead, imaging you emitted: { userID: 123, event: "LockOut", attempts: 3, message: "User 123 had 3 failed password attempts". Same message, but the metadata is now included as standalone fields. The cloud provider will ingest these and then you can query by those fields in the log explorer and don't have to bother parsing.

Plus, you can use the power of log-based metrics. That is a feature where when a log message matches some filter, a metric is emitted. Then you are free to chart those metrics and alarm on them. This allows you to add metrics to your application with 0 service code.

There are structured logging libraries in every language and framework. It's pretty straightforward. Also, set up some middleware to populate properties, like userID and route in every message for the request.

Serverless

To me, a "serverless" product means that it:

Is fully managed. No hardware, no machine instances
Automatically scales up and down. And it should scale to 0
Charges based on use

I've done everything from hosting on a machine in my apartment, to a dedicated server in a data center, to a VM in a data center, to cloud instances, to kubernetes. Serverless is so much easier and usually cheaper. If you are not using a serverless solution, you better have a very good reason.

Databases

Code-Based Migrations

Every time you make a change to the database, like adding a column, it should be done in code. People have been doing this for a long time and there is a lot written about it, so I'm not going to go too deep. But you'd be surprised how many people don't do this very simple thing. Set it up, use it from the start, and you'll avoid a lot of pain.

todo: what to avoid, like creating a required field

Code-based Seeding

Migrations should include data such as lookup values (postgres supports enums, by the way) that you want to replicate between environments.

For testing environments like dev, you want to insert a bunch of test data so the database can become immediately useful.

PR-based environments

The idea here is that when you create a Pull Request, you will spin up a brand-new environment, which shouldn't be hard because you used IAC. The code for that PR is deployed to the environment and integration tests are run against those new endpoints. When the pull request is merged, the environment is destroyed.

In addition to testing the code in something as close to prod as possible, it also is a great way to share your work with others. Like, say you are making a new endpoint. Now, you can have folks go to some URL and try it out.

This sounds great but it has nuances. Databases are tricky here. Plus you may run into limits and other issues with certain cloud products when constantly creating and destroying resources. Also, avoid waste and set up CI so that you don't use PR-based environments for something like a readme change.

Testing

Integration tests

Unit tests are a given. Get those going. But then you should look to test your service as a whole. These tests are typically performed by running a series of HTTP requests and verifying their results.

You should look to cover as much as possible with integration tests. And you should be able to run integration tests against your local stack.

Since you're using PR-based environments, you should run the integration test suite against every PR. Want bonus points? Set it up so that service code run during integration tests will count toward test coverage.

Invest time in building a nice integration test framework. It must be very easy to author new tests. The more test coverage, the better. Also, parallelize integration tests if possible to reduce wait time.

Deploying to staging and prod is only possible once tests have passed. Make sure there is a manual override in the event of an emergency. May you never have to use it.

Integ tests should keep good logs that are easy to look up. When an integ test fails, you should be able to quickly find the test and line number that it failed on. If you are doing browser automation, include screenshots. Screenshots can be saved in a bucket that has lifecycle policies set up to delete old files so you don't stockpile a ton of old screenshots.

Canary Tests

You built a slick integration test framework. Let's use it to monitor service health. I refer to this as Canary Tests. It's just a subset of your integration tests that constantly runs. You don't need the entire integ test suite, but you need enough that you are confident it's exercising the main service code and all your dependencies (if a dependency is down, you wanna know about it).

Set up some system that will run these tests every, say, minute. Maybe you use a serverless function on some timer (remember that 15 minute timeout lol), or some long-running container. Whatever the case, just have it run nonstop.

When a test completes, it should emit a success/failure metric. You can do this by emitting metrics directly or through log-based metrics. Set up alarms that will go off it:

Some number of tests fail in a given time window
Metrics stop being emitted. Alarms can be set to go off due to lack of data. This means your canary test runner broke.

Load Tests

Integration tests ensure your code works. Canary tests ensure your service is running. But wouldn't it be nice to know how much demand your service can handle? That's what load tests do.

Load tests are great for ensuring you can handle that big customer you are about to land. They're also useful for identifying bottlenecks in your service so you know what to focus on when you need to scale.

Some services, like Google Cloud Run, scale containers based on concurrent request count. Say you choose 100 for that number. That means that your service should be able to handle 100 concurrent requests. After that, it'll scale up. If you choose a number that's too high, your service will die before it gets scaled out. If you choose a small number, you'll incur unnecessary costs. Load tests can help you determine this number.

Some people do load tests occasionally by hand. Not a fan. Instead, make load tests part of your CI/CD pipeline. After integ tests run in staging, run your load test. Load test results should include information like concurrent requests, faults, etc. You should be able to correlate that with other metrics to determine service health during the test.

Keep it realistic, but if you want to see what it takes to take you down, knock yourself out. Just don't automate that, since your CI pipeline will constantly break and/or your costs will blow up.

Don't automate load tests in production. You could inadvertently take the service down and affect your customers. Production load tests are controversial, but if you do them, do it by hand at an odd time and always have your finger on the cancel button.

Test-driven Development

A few years ago, I worked really hard to build a testing system for our React/Graphql frontend. It included a CLI that let you choose and run a test. Then your browser magically went through the flow to the feature you are building. We set up graphql mocks so that the database was in a certain "situation". Developers didn't need to get their databases in the right state. And it worked on everyone's machine. Since it could run headless, we ran it on CI and it contributed to test coverage.

This TDD was a game changer. Test coverage shot up. So did productivity. It was the best developer experience I've ever seen. I'm convinced there is no better workflow than TDD, as long as you have great tooling in place.

There are tools out there. Try them out, don't be afraid to customize things. Always aim for a world-class developer experience. Results will follow.

You can also do this for the backend. For that, you'll need to set up dependency mocks. This is quite an intricate thing and it's dinner time now, but look into it.

Alarms

Use IAC to set up alarms. You should have several severities. Sev1 should be for emergencies, the stuff that's worth getting woken up for. Sev2 should be a softer notification that can wait until the next day. You can use Slack for this but there are better tools like PagerDuty and some open source stuff that will give you proper notifications on your phone that can't be ignored.

A big mistake people make is they create alarms that constantly go off and/or are not actionable. Every alarm should require some action, even if it's just notifying a customer. Nobody likes being woken up for no reason. And if there are a gazillion notifications, the important ones will be lost.

Set up alarms on all the important things: 500s, latency, connection failures, dependency failures, etc. But make it easy to tune them via IAC. Review them often.

A nice feature of alarms is that in addition to going off when some threshold has been breached, they can also go off if no metrics are being emitted. If you are running canary tests (which you should be), then turn this on for any applicable alarms. This way if your canaries go down, you'll know about it. Use the "alarm on missing data" feature for anything else that should receive metrics at some interval.

Certificate Expiration Alarms

Having your service go down because of an expired certificate is embarrassing. Don't let that happen to you. You could set a reminder for a year from now, or pray that someone emails you about it. But there is a much better way.

Set it up so every day or so a metric is emitted that indicates "days to expiration". Then add an alarm that will go off and alert you when that number gets below, say, 14. Be sure to set that alarm to go off on missing data, just in case the tool that emits that metric is working properly. Do this, and as long as your alarm notifications are set up properly (slack, email, phone, paging app, etc) then you don't have to worry.

If you set up an occasional operations meeting, it's worth including these metrics in your dashboard. Remember that it's not just ssl certs I'm talking about. Sometimes you need certificates for other things, like setting up an OpenTelemetry collector.

There are tools available for this. Or build your own if that's what you're into.

Throttling

If you use some kind of API gateway, you should already be protected against someone overwhelming you with requests to take you down (DDOS). But that doesn't necessarily stop an existing customer from sending a ton of requests. Too many requests from a customer could starve your resources, which could affect another customer. I've seen this happen, especially when customers start writing batch scripts.

Set up a system to prevent a customer from sending too many requests to a given endpoint.

Easy Async Processing

Performance should influence everything you do. Faster requests equals success. A few milliseconds here, a few there...it adds up. I've witnessed big deals fall through because of performance alone. Set a bar, say, 50ms p99 latency, which means that 99% of the time, your service processes requests within 50ms. And that's a pretty generous number IMO.

That said, sometimes you have processing that needs to be done because of the request, but isn't important to the response. For example, say you want to send a webhook at the end of the request. Stuff like that should be done asynchronously.

Building a background job processor can get extremely complex. If it's a hassle to do, people will ultimately just stick that stuff in the synchronous call path. I'm not saying full-blown async workflows are unnecessary, but I highly recommend building a simple, generic asynchronous work processor.

This isn't hard to do. Make a static function, serialize a call to it, send a message with that serialized invocation, and have it processed by something like a cloud function.

Work backward from the developer experience. It should be as simple as:

function sendEmail { ....

// in your handler code
ExecuteAsync(sendEmail, user.ID, email ...

The actual implementation will depend on your language. Not that I'm recommending C#, but that language has a cool way to serialize code expressions, so I was able to do something pretty cool:

this.Defer(() => SendEmail(user.Id, email))

Now you have a dead simple way to defer processing to some background job that will be fault tolerant and reliable. You may want to track this in a database and log the UUID identifier just in case you want to track it down later.

Internal Libraries

At some point, you'll end up amassing a bunch of shared code. You may want to put all this in a single git repo. Often though, it would be easier if this code were a library just like any other dependency. Since you likely don't want to open source it, you'll need a way to publish your npm/cargo/gem/nuget/etc package.

Cloud providers have products that provide internal package distribution. I've found it to be a bit of a hassle to set up and migrate to, so consider doing this early on.

For JS/TS folks, set up your multi-package workspace early. You'll be happy you did because it's a pain to switch to later on.

Cloud Profiling

Cloud providers offer a product that can instrument your running service code. This data will tell you exactly what functions are being called, how often, and how long they take. It's easy to turn it on and it adds very little overhead.

Of course, this only helps if you use it. I've found that many people forget about it. If you think it would be valuable for you, set it up and remember that it's there when you need it.

Feature Flags

Feature flags are basically if statements that allow you to run code for certain users. If you are building a new feature, put the code behind a feature flag. This will allow you run integ tests on it in production without affecting existing customers.

Set up your integration tests so they can have or not have a given feature flag. Have a simple "negative" test to ensure the feature-flagged code can't be run by someone who doesn't have the flag.

I recommend using a product like AWS AppConfig that allows you dynamically update the customers allowed access to a feature flag. This is great because you don't have to wait on deployments. Just make sure you have safeguards in place to prevent your app from failing if you somehow screw up the config data.

And please, please, please make it a required part of your development process to remove the feature flag conditions once everyone has the flag. It's super annoying to use a codebase that has a ton of outdated feature flag conditions. Plus, it'll allow you to remove that "negative" feature flag integration test.

Secrets

Secrets include database connection strings (which you can avoid in many cases and instead use IAM permissions), API keys for your dependencies, certs, etc.

Every cloud provider has a secret management product. Only the deployed code should have read access to those secrets. You can expose them as environment variables or read directly from the secrets manager API. Whatever you do, don't print them out in the logs 🤦🏾

todo How to populate secrets? github secrets, by hand, etc

Deployments

Let's piece together some of the things previously discussed and go through the deployment process I recommend. It is designed for automatic, safe, fast, audited deployments. Here's what happens after a PR is merged.

Code is merged into main
main is built, linted, and unit tested
Staging deployment:
1. Database migrations run in staging
2. Code is deployed to staging
3. Integration tests run
4. Load tests run
Bake time. This is simply a waiting period before promoting to prod. During this time, canary tests will exercise your code. Bake time is great for things like memory leaks that take time to occur. During bake time, you need to monitor for any alarms that go off. Only if no alarms go off will the code be promoted to prod.
Prod deployment. You may want to enforce a time window for this. This avoids having code deploy to production in the middle of the night and potentially wake you up if it's introduced problems:
1. Database migrations run in prod
2. Code is deployed to prod
3. Integration tests run
Rollback window. This is similar to bake time. For some predetermined duration, watch for alarms. If an alarm goes off, then the code should automatically roll back to its previous version. You need to decide if you want to also revert the migrations. I'm torn on this one.
Deployment is complete

A few things to remember:

You should always be able to quickly determine what git commit is running in production. In your CI/CD, you can add a tag and/or include the commit hash in the revision name. This is crucial for troubleshooting. Error messages often include line numbers. Well, if the code in main is different than what's deployed, you may not get the correct line when you navigate to the file. If you know the commit, you can git checkout 4b32da9c and have the same code that's running in production.
Everything should be manually overridable. In an emergency, you want to be able to deploy code within a minute or two. Hopefully you never have to use it, but it should be there. Set it up some manual overrides notify you in Slack or whatever.
Make sure there is a way to view all deployments, both in progress and in the past. Always be able to know what git commit was running in staging and production at any given time.
Consider deploying only a portion of traffic at first. Like, when you deploy to prod, only 10% of traffic gets routed to the new deployment for a while, then the remainder if no alarms go off. This is often called "canary deployment". This minimizes the customer impact caused by a bad deployment. Tools like Google Cloud Run and AWS Lambda have features that make this easy. Just make sure you follow the next tip
Do not assume that your code will be deployed atomically. In other words, plan for the new code and the old code to run simultaneously for a bit, even if you have a single host. This is why we don't do things like adding a new required database column. Think this through, especially when making data changes. When in doubt, break it into multiple smaller deployments. Like for adding required database column: add a nullable column, start writing it, backfill it, then make it non-nullable.

One of my brilliant previous coworkers wrote a great blog post on this. Highly recommended: https://aws.amazon.com/builders-library/automating-safe-hands-off-deployments/

Linting

It kills me when I see a PR comment related to style. It's a waste of time. Linters allow you to set up rules that enforce certain code styles. Have fun deciding whether to use semi-colons or not. But just make a choice and be consistent. Style-related comments should be unnecessary.

Debugging

All major languages have step-through debuggers. Heck, there's one right in your browser's dev tools. The idea is that you can pause code execution at any given line and be able to walk step-by-step through the code, be able to view variables, move up and down the call stack, execute arbitrary code, etc. They are incredibly helpful and faster than techniques like printing out log statements.

Throughout my career, I've constantly been astonished by folks who code without the use of these amazing tools. Some poor souls simply don't know they exist. Some want to use a debugger, but the application isn't set up properly for it (some JS bundler issue, jvm args not set, docker port not exposed, etc).

Not having a step-through debugger is a deal breaker for me. I need it to always be available at my fingertips. Ideally, pressing a single key combination will start your application with a debugger attached and it's all integrated into your editor. Make sure your application is set up for debugging. Include instructions in the readme. Do what you can to make it easy for everyone to debug. And please teach newcomers all the fantastic features your particular debugger supports so they can be as effective as possible.

I've been going on about TDD, so be sure that it's possible to debug your application code with integration tests running. If you're running a straightforward web app, then this should be easily possible by pointing to localhost. I've found though that this can be quite difficult with serverless functions. There are tools out there that can help with this.

Analytics

Metrics are important for key operational things like latency and errors. But they don't give you much insight into how people are using your application. This is where analytics come in. If you properly instrument your code with analytics events, you can see how many people are using your feature, and of those, how many are completing it. I suggest choosing an analytics tool that has a funnels feature to easily understand how effective your feature is.

When I worked for Vydia, I was surprised to learn that in addition to coding and testing, engineers were responsible for creating a Mixpanel dashboard for each feature they develop. This ensured that all the proper events were being emitted, and was a great handoff between engineering and product. While it felt weird at first, I quickly realized how brilliant it was.

Build an analytics dashboard that helps you easily understand what features people are using, and how effective those features are. Funnels are incredibly valuable.

Garbage Below

dev process:

Add feature flags
Integ tests
Canaries
Feature analytics dashboard
Remove feature flags

Funnels

Alarms

Testing

Tdd style frontend

Backend mocks

Certificate alarms

Metrics

Service latency volume and faults

Uptime percentage

Certificates

Auto status page

Pager duty

Lambda or cloud run style thing

Hot reload

Assets on cdn

Ability to run routes and or service code async

Service should be very easy, like a CDK class

Allow customization to services like buckets and queues

Service roles should have minimum scope but keep it flexible

Protect deletion for staging and prod

No shared data stores among services

Consider a db API especially for mocking

Docker and code spaces

Lambda should have solid local testing like devbridge

Dependabor

Tracing

Profiler

Mixpanel

Load test

Clerk

OpenAPI

API from the start - generate docs and clients

Error reporting

Dependency monitoring

Stackoverflow for teams

Appconfig

Throttling

Database Neon Postgres

Summary

Cloud providers are different. If a cloud provider offers a managed service for any of these, use it.

At some point

Stephen Potter's Blog

Stephen Potter's Blog

My Ideal Web Service Setup

Infrastructure as Code

Accounts

Structured Logging

Serverless

Databases

Code-Based Migrations

Code-based Seeding

PR-based environments

Testing

Integration tests

Canary Tests

Load Tests

Test-driven Development

Alarms

Certificate Expiration Alarms

Throttling

Easy Async Processing

Internal Libraries

Cloud Profiling

Feature Flags

Secrets

Deployments

Linting

Debugging

Analytics

Garbage Below