Rate limits & scaling

fulfillmenttools APIs have a rate limit of ~1000 requests per second

This limit applies across the entire API and can be raised on demand. We also reserve the right to impose a limit that will lower 1000 requests per second in the future.

Consider a system that provides APIs as the primary way to interact with it. Thus, API Calls have to be issued from any client's side to create/update or read data at/from the platform. When the number of Requests rises during usage, a given infrastructure either needs to block calls (e.g. impose rate limits that limit the resource to only 1 instance) or needs to scale with the load. fulfillmenttools decided for the latter: The platform scales with the load imposed by the clients using the API.

That means, in general, API usage can scale with our customers' businesses. To provide the necessary service level, however, the platform transparently scales the needed services horizontally - which means new containers that can answer requests need to be created. In that case, the HTTP Code 429: Too many Requests comes into play - potentially way earlier than 1000 requests a second, as you will learn in the following section.

Scaling behavior under load

When entering a scale-up phase, the API may respond with HTTP Response Code 429 (too many requests) to some requests. This does not necessarily mean that you reached a rate limit, but it means that the current call could not be processed due to the lack of resources.

When you receive this response to a call, an instance is already starting. However, you need to re-issue the request to have it processed.

Drawing

Scaling up

When there is a low load, the provided resources are minimal. In the example above, one instance is enough to handle all the calls in the depicted "low load phase."

At the beginning of a "high load phase" we also enter the "scale up phase". On the server side, one or more instances start to provide the needed resources to handle the current and anticipated load in the near future. This happens fully automated.

During this time, the needed resources will be provided and take over some of the load. In this example from now on two instances are able to handle incoming requests.

High load phase

During high traffic/request counts, the provided resources handle the incoming calls. If the load rises further, additional Instances are being provided, and the behavior will be similar to the one described in entering a high load phase. However, the percentage of affected calls will further decrease as the ratio of operationally available instances rises over time.

Scaling down

When the API usage drops again, when the load lowers, unneeded instances are shut down to safe resources. This happens completely transparently to the clients that issue requests. After the load is gone, in the above example, only one instance is enough to handle the load.

When does the system scale?

There is no definitive answer to this question. It depends on multiple parameters, such as the complexity of the call, the required CPU or Memory, the number and type of parallel calls that need to be processed, and the currently available number of instances.

Last updated