How can we capture the incoming requests to replay them in a shadow production environment or in a test environment?

How can we investigate request failures like 5XX in the production environment?

We had the Envoy Proxy in front of our micro-services that was doing traffic routing, requests introspection, rate-limiting, circuit breaking and much more. You can read more about this architecture here.

We leveraged the same Envoy Proxy to capture all the traffic hitting our micro-services to solve these problems and it helped us support many other use cases. Keep reading to discover more.

The Envoy Proxy provides the AccessLog Filter, which allows us to select the request and response headers to log them to the output devices like stdout or file log. Also, it allows us to send these access log messages to the configured service using protobuf over the gRPC. Known as the envoy.http_grpc_access_log configuration.

Now, we needed a gRPC service which would listen to these messages and save them in a storage location. We decided to use the Kafka stream to write these messages. Kafka allowed us to have multiple consumers to solve multiple problems. One consumer would replay the traffic in a Shadow Environment, another would just dump it into the Data Lake and others would consume it for performing the asynchronous operations. The diagram below depicts the flow.

The Envoy Http Filters are allowed to augment the request and response headers.

We needed a Http Filter to capture the request and response payloads and add them as additional headers to the existing request and response headers.

The Lua Http Filter envoy.lua allows us to embed the Lua code into the Envoy Proxy which would run during request and response flow in the Envoy’s process address space.

In the Lua code, we just copy the request/response body and add these payloads as additional request/response headers. It’s just a memory copy, so not much additional latency was introduced due to this step. However, it could increase the memory footprint. So, we decided to add a size check on the payloads to limit the size of the request/response payloads to be less than 8K each. All of our requests went through this criteria, although we ended up dropping a few large size response payloads from sinking into our Data Lake. This was acceptable since response body was not much of a concern for us.

However, there are two unwanted side effects of adding these additional request/response headers.

First, the additional request headers would be passed to the upstream micro services. Second, additional response headers would be passed to the clients.

How to suppress the additional payload headers beyond Envoy ecosystem?

The Envoy provides a solution to this — we can just prefix the header names by “:” and these headers are treated specially. They are suppressed so that they will not reach the upstream micro services or the clients. Notice the special header names “:req-body” and “:resp-body”.

However, these headers stay within Envoy and also get passed to the gRPC access log service as expected.

Here is the sample snippet on how to capture the request/response payloads and add them as the headers.

Here is the sample Envoy configuration snippet to send the selective request/response headers to the gRPC cluster.

On receiving the request and response payloads in addition to other headers, the Traffic Sink Service would simply write the events to the Kafka Sink. The anonymize service would consume from this fire hose and anonymize the PII such as password hash, access/auth tokens and others before sinking the records into the Data Lake.

Use Cases

Having this infrastructure in place opens up interesting opportunities to solve many use cases.

We could have another service to consume the ongoing traffic from Kafka Sink to replay it in the shadow production environment. This would also allow us to run different versions of the algorithms or ML models in the shadow environment. We can run the live nDGC evaluation against these algorithms before rolling them out into the production environment.

Usually organizations maintain the test suits for running the load tests. These test suits may not be as close to the production traffic pattern. Often the test suit goes stale as well. Having all the production requests saved in the Data Lake along with the event time, would allow us to replay traffic with appropriate rate limiter tuning to perform load testing in the performance environment.

Since we are hashing out the auth/access tokens, running the tests using these captured requests would work well for the read APIs like Search and Browse. However, it would still be possible to replace the tokens smartly with the test user’s tokens in test environment to replicate close to production traffic pattern.

The Data Lake would have all the requests detail, response meta-data including response status code. It’s easy to pull out the failed requests and replay them in the Dev environment to reproduce the issue seamlessly.

Detecting the bot traffic is a cat and mouse game. The bad bots often find the gaps and make their way through. With the ability to replay traffic pattern as it occurred in the production environment allows us to improvise the bot detection algorithms easily.

It’s possible to offload the async operations from the main thread of the micro services to this infrastructure. The operations like Recently viewed items, Add to favorite etc. could be served asynchronously.

In conclusion, the production traffic shadowing has many advantages and accomplishing this through Envoy Proxy is seamless.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store