Ethereum internals: Geth storage model
An Ethereum blockchain client stores two sets of information: Block data
and State data
.
Block data
refers to the primitive objects of the system:
- Header: Metadata for the block.
- Body: The set of transactions for the block (in Ethereum, the uncle headers are also part of the body).
- Receipts: The execution result of the block, one receipt per transaction.
State data
is the set of account and storage records of the chain. The state changes when a Transaction from the block gets executed with the EVM. Internally, the state is modeled as a Merkle tree.
In this post, I am going to focus on how the Block data
is stored in Go-ethereum
.
Embedded KV-database
Go-ethereum
uses as the main storage engine an embedded kv database. An embedded database is a database management system that is integrated inside the same application, most often as an imported library and not as a separate external component. Historically, go-ethereum
has used leveldb
but it is in the process of switching to pebbledb
.
A key-value
database stores the data as a set of records each one identified with a unique key. Then, it provides utilities to insert, query, and list range data based on this key. Unlike traditional relational databases, it does not provide a schema and lacks the possibility to create complex queries on top of the data like with SQL.
As a workaround to have complex queries, developers can use the same kv primitives to manually construct ad-hoc indexes to query the data. For example, go-ethereum
stores block headers by its hash (key) but also has kv entries for number→hash to resolve the block by its number (i.e. eth_getBlockByNumber
endpoint).
Indexes
All the objects in leveldb
are stored under the same logical bucket of data. However, since we need to store different types of information, we need to introduce some mechanism to differentiate each set of data. Go-ethereum
uses prefixes and suffixes to namespace the information.
List of raw data stores:
<header_prefix><number><hash>
:<header>
- Stores the block header in RLP format.
<block_body_prefix><num><hash>
:<body>
- Stores the body (transactions and uncles) for the block in RLP format.
<block_receipts_prefix><num><hash>
:<receipts>
- Stores the list of receipts for the block in RLP format.
List of the indexes:
<header_prefix><number><hash><difficulty_suffix>
:<difficulty>
- Maps a block (number, hash) to the total PoW difficulty up to that point.
<header_prefix><number><header_suffix>
:<hash>
- Maps the blocks in the canonical chain to its hash.
<header_number_prefix><hash>
:<number>
- Reverse index from block hash to number.
- A query to this index is required to retrieve the contents of a block by its hash.
<txn_index_prefix><tx hash>
:<block number>
- Maps the transaction hash to the block in which the transaction was mined.
- This index must be rewritten and updated for every reorg of the chain.
- It would require 3 database reads to get a transaction by its hash: tx hash → block number, block number → block hash, (block number, block hash) → body.
Among these, as it is expected, the entries that amount to most of the space are the ones that store the raw RLP information about the block.
The ancient
datastore is where the objects get moved once they are considered final.
Ancient
Since v1.9.0, Go-ethereum
introduces a “freezer” database. Once blocks pass some threshold value (90.000 blocks by default), the header, body, and receipts of the block are moved from the leveldb
database into an append-only database. Indexes are kept on the quick-access storage.
Values in this ancient database, cannot be updated or deleted. Note that before the Merge and the update to a PoS, this threshold value would have introduced a limit on the max number of block reorgs allowed in the client.
Some of the benefits of moving finalized data to the “freezer” are:
- It reduces the size of the
leveldb
database improving the database performance and compactions. - Ancient data has lower operational requirements than
leveldb
so it can be served from cheaper hard drives.
Ancient format
Each object (Header, Body, and Receipts) is stored in a table
. A table
consists of one index file and multiple data files. Any of these files might be compressed with Snappy.
The index file stores sequential entries of 6 bytes each. Each entry represents a block of the canonical chain. Then, the index entry for block 54 would be located at (54*6) of the index file.
Each entry stores a tuple (file number, offset) that represents where in the content files the data is stored for that entry. Note that since an entry only stores the offset, it is necessary to read the next entry to retrieve the total length of the data (subtract the offsets).
Example
This is an example of how to iterate over the Block data
in Go-ethereum
using the geth-data-layer
library:
package main
import (
"fmt"
gethdatalayer "github.com/umbracle/geth-data-layer"
)
func main() {
store, _ := gethdatalayer.NewStore("..../chaindata")
iter := store.Iterator()
// iter.Seek(1000000)
for iter.Next() {
val, _ := iter.Value()
fmt.Println(val.Number)
}
}