Scaling CouchDB to 1 Million Requests Per Minute (Part 1)

I had the fortunate experience to solve reliability issues related to CouchDB. Where I’m working, it’s built into our core product which is the POS system itself. Under the hood, it’s basically a version of CouchDB rewritten in Dart — it’s open source btw.

I want to share some lessons I’ve learned while trying to scale CouchDB to reach 1 million requests per minute.

Overcoming `EMFILE` Error

We deployed CouchDB in EKS. This helps to easily upscale the nodes or downscale depending on the workload usage. For a database, that usually happens as the company grows.

But day 0 configurations often don’t hold up. Cracks will appear and in our case it was the EMFILE error.

CouchDB CPU

When the issue surfaced, our CouchDB started to lock up and latency was high. At the time, the hosted node was of type m7g.12xlarge so it had around 48vCPU—but how come it reported over 60 vCPU? At this point the database consumers was waiting for a response from CouchDB. Rather than waiting for it to recover, our kubernetes liveness probe initiated a restart for the affected pods. The degradation lasts around 10-15 minutes.

Other correlated metrics that we saw was the OS file descriptor and database opened metrics—there was a cutoff in the graph during the event.

CouchDB OS File Descriptor and Database Opened Metrics

What really nailed down the investigation was the following logs:

{{case_clause,{error,emfile}},[{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,108}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,239}]}]}

[error] 2026-01-02T12:11:26.035407Z couchdb@XXX <XXX.XXX.XXX> -------- Could not open file ./data/shards/XXX-XXX/XXXX.couch: too many open files

EMFILE coupled with too many open files was enough for us to get an answer. When we check back our deployment configuration, we found that:

couchdbConfig:
  couchdb:
    max_dbs_open: 100000

In CouchDB, documents can be stored within a named database. After opening a database to read/write a document, it can hold on to it so subsequent operations don’t require further internal lookups, putting less strain on the database.

Here we see that the previous value set was 100k. And when we run the following:

-> ulimit -n
65536

Here is a breakdown of what that means:

ulimit (User Limit): This is a shell built-in command used to view or restrict the system resources available to the current shell and any processes started by it.
-n (Number): This specific flag targets the limit for open file descriptors.

The issue was that CouchDB operated as if it can hold onto 100k databases at once. And in Linux, everything is a file. This assumes the process could support 100k file descriptor and then some.

One thing about CouchDB is that it’s written in erlang. Under the hood, it implements the actor model. Every request entering CouchDB get its own “actor”, which also separately gets its own “mailbox” that it listen to and executes from. Fundamentally, it’s a concurrency paradigm native to Erlang.

The CPU metrics were using the following:

rate(container_cpu_usage_seconds_total)

It is taking a counter (total CPU seconds used) and calculating the rate of change over a small window of time (e.g., 1 minute).

So when the process was over the 65536 threshold:

The actor receives the EMFILE error. Because actors don’t handle catastrophic system errors, the actor immediately dies.
The Supervisor actor sees its worker just died. Its programmed job is to ensure the work gets done, so it instantly spawns a replacement actor to try the task again.
The new actor immediately tries to open the socket. The ulimit is still maxed out. The kernel rejects it. The new actor dies.

The CouchDB Erlang VM is now spawning, crashing, and garbage-collecting thousands of actors per millisecond.

Because the spike was so sudden and aggressive, the rate() function mathematically extrapolated that extremely steep slope. It went beyond the physical 48vCPU and calculated the over 60vCPU

Fix

So to prevent this from happening again:

we raised the ulimit -n threshold to 100k via the container args by running

ulimit -n 100000 && exec /docker-entrypoint.sh couchdb

Lowered the max_dbs_open from the previously set 100,000 value.
Monitoring dashboard and automated alerts for the file descriptor utilization against the newly set ulimit -n

Conclusion

It took me a while to understand how to maintain and make CouchDB reliable.

I’ve been burned by incorrect analysis and also investigations that led to nowhere. The one unlock I had was realizing how useful it was for me to use AI to learn how Erlang works internally—even diving into the codebase for me to understand the different code paths.

I hope this helps. This is only part 1 so I’ll definitely post more. Part 2 will be about how I used AI to export additional CouchDB metrics that weren’t available in other existing OSS solutions.

Overcoming EMFILE Error

Fix

Conclusion

Overcoming `EMFILE` Error