How can we capture the incoming requests to replay them in a shadow production environment or in a test environment?

How can we investigate request failures like 5XX in the production environment?

We had the Envoy Proxy in front of our micro-services that was doing traffic routing, requests introspection, rate-limiting, circuit breaking and much more. You can read more about this architecture here.

We leveraged the same Envoy Proxy to capture all the traffic hitting our micro-services to solve these problems and it helped us support many other use cases. Keep reading to discover more.

Envoy’s AccessLog Filter

Now, we needed a gRPC service which would listen to these messages and save them in a storage location. We decided to use the Kafka stream to write these messages. Kafka allowed us to have multiple consumers to solve multiple problems. One consumer would replay the traffic in a Shadow Environment, another would just dump it into the Data Lake and others would consume it for performing the asynchronous operations. The diagram below depicts the flow.

Image for post
Image for post

How did we capture the request and response payloads?

We needed a Http Filter to capture the request and response payloads and add them as additional headers to the existing request and response headers.

The Lua Http Filter envoy.lua allows us to embed the Lua code into the Envoy Proxy which would run during request and response flow in the Envoy’s process address space.

In the Lua code, we just copy the request/response body and add these payloads as additional request/response headers. It’s just a memory copy, so not much additional latency was introduced due to this step. However, it could increase the memory footprint. So, we decided to add a size check on the payloads to limit the size of the request/response payloads to be less than 8K each. All of our requests went through this criteria, although we ended up dropping a few large size response payloads from sinking into our Data Lake. This was acceptable since response body was not much of a concern for us.

However, there are two unwanted side effects of adding these additional request/response headers.

First, the additional request headers would be passed to the upstream micro services. Second, additional response headers would be passed to the clients.

How to suppress the additional payload headers beyond Envoy ecosystem?

However, these headers stay within Envoy and also get passed to the gRPC access log service as expected.

Here is the sample snippet on how to capture the request/response payloads and add them as the headers.

Http gRPC Access Log Configuration

How to anonymize the sensitive PII (Personally Identifiable Information)?

Use Cases

Replay Traffic in Shadow Production Environment

Load Test in Performance environment

Since we are hashing out the auth/access tokens, running the tests using these captured requests would work well for the read APIs like Search and Browse. However, it would still be possible to replace the tokens smartly with the test user’s tokens in test environment to replicate close to production traffic pattern.

Investigate Requests Failure

Improvise Bot Detection Algorithms

Offload Async Operations

In conclusion, the production traffic shadowing has many advantages and accomplishing this through Envoy Proxy is seamless.

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store