Engineering Note · LinkLift

Why cache-aside is the right shape for a URL shortener

LinkLift is a URL shortener I built end-to-end with Redis, PostgreSQL, and an EC2 host. This note is about why I picked cache-aside, what that bought me, and what the load numbers are actually saying.

Throughput

5,998 req/s

p99 latency

50 ms

Concurrency

100 conns

The workload, in one paragraph

A URL shortener is a read-skewed key-value lookup. The hot path is GET /:slug, which has to resolve a 6-8 character slug to a long URL and send a 302 before the visitor notices the redirect. Writes are rare, small, and more forgiving. That imbalance is the whole reason cache-aside fits.

Cache-aside, concretely

On a redirect request, the app checks Redis first. On a hit, it returns the destination URL from memory and never touches Postgres. On a miss, it falls back to Postgres, writes the resolved entry into Redis with a TTL, and then redirects. New links are written to Postgres as the source of truth; the cache can either be invalidated directly or left to expire.

The useful part is that Postgres only sees misses. Once the cache is warm, that is a small slice of total traffic. Redis handles the repeated lookups at memory speed, which is what makes the 50 ms p99 number unsurprising instead of magical.

Why not just pool Postgres connections?

For low-volume traffic, that would be fine. The reason to add Redis is the failure mode during a spike. With Postgres only, every redirect is a query, every query needs a connection, and a traffic burst walks the pool toward its ceiling. Past that point, latency is mostly queueing.

With cache-aside, the same spike mostly lands on Redis. The database absorbs misses, which are tied to the number of new or uncached slugs rather than total request volume. It also gives me a clean place for rate limiting that would still work if the app moved past one Node process.

Rate limiting and analytics

Rate limiting is 10 requests per minute per IP, implemented in Redis with an INCR + EXPIRE pattern keyed by IP. Keeping that state out of process memory means the limit stays consistent if the app is scaled horizontally.

Click analytics capture user-agent, IP, and timestamp per redirect, but they run async. The user gets redirected first; the analytics write happens after. Losing a click row during a crash is acceptable. Making a visitor wait for analytics to commit is not.

Deployment shape

The deployment is intentionally boring: one EC2 host, Docker Compose for app + Postgres + Redis, pm2 supervising the Node process, and a startup script that runs migrations before the app starts accepting traffic. The React + Tailwind frontend is served from the same host. The interesting part of this project is the read path, not the infrastructure.

What I would do next

  • Move click analytics to a queue or Redis stream and batch inserts into Postgres. Async writes are fine now; under sustained load, they would probably be the next bottleneck.
  • Add cache stampede protection, probably a small single-flight lock or probabilistic early refresh, so one hot slug expiring does not briefly hammer Postgres.
  • Add lightweight redirect health checks and alerts so failures are visible before they show up as broken links.

Numbers above were measured against the deployed instance with 100 concurrent connections.