Collects writes in memory, then merges them to disk in sorted batches - built for heavy writing.
writes -> in-memory memtable; flush to sorted disk files; reads check memtable then files (newest first)
Frequently asked questions
Is an LSM tree like jotting sticky notes then filing them later?
That is the idea precisely. You scribble notes on sticky pads all day - fast to write - then file them neatly into folders each evening. An LSM tree buffers writes in memory (the quick sticky-note step) and later flushes and merges them into sorted files on disk (the evening filing, called compaction). This makes writing extremely fast, which is why LSM trees power write-heavy systems like logging and time-series databases.
Why are LSM trees so much faster at writing than B-trees?
A B-tree updates data in place, which means each write may seek to a specific spot on disk and modify it - lots of slow random I/O. An LSM tree instead buffers writes in memory and later flushes them as whole sorted files written sequentially. Sequential writes are dramatically faster than random ones on both spinning disks and SSDs. So an LSM tree turns many small random updates into a few large sequential ones, which is why it dominates write-heavy systems like logging and time-series databases.
If writes go to memory, how does an LSM tree avoid losing data on a crash?
Before a write is acknowledged, it is also appended to a write-ahead log on disk - a simple sequential append, so it stays fast. The in-memory memtable holds the data for quick access, but the log is the durable record. If the system crashes before the memtable is flushed, it replays the log on restart to rebuild the lost in-memory state. This combination gives both speed (memory + sequential log) and durability (the log survives crashes).
Why might a read be slower in an LSM tree than a B-tree?
Because the data for a key could be in the memtable or in any of several disk files, the newest copy winning. A read may therefore have to check multiple places instead of one. To keep this fast, each disk file has a Bloom filter that quickly answers 'this key is definitely not in this file,' letting the read skip most files. Even so, reads do more work than a B-tree's single path, which is the fundamental trade: LSM trees optimise writes at some cost to reads.
What is compaction and why is it necessary?
Compaction is the background process that merges the accumulating sorted files into fewer, larger, sorted files. Without it, reads would have to check ever more files, and outdated or deleted entries would pile up wasting space. Compaction merges files, discards superseded values and tombstones (deletion markers), and keeps the number of levels bounded. It is the 'filing the sticky notes each evening' step - essential housekeeping that trades some background I/O for sustained read performance and reclaimed space.
Which real databases use LSM trees, and why?
Cassandra, RocksDB, LevelDB, HBase, ScyllaDB, and the storage engines behind many time-series and messaging systems all use LSM trees. They are chosen wherever write throughput is paramount: ingesting sensor or metrics data, logging, message queues, and high-volume key-value workloads. The Bloom-filter calculator in this section shows the exact mechanism LSM engines use to skip disk files on reads, so the two structures are closely linked in practice.