Go Application Benchmarking with GitHub PR Integration
Last week, I refactored the B-Tree serialization layer in VeryLightSQL, my toy database project. The new code was cleaner, more idiomatic, and eliminated all uses of Go’s unsafe package.
After opening the PR, GitHub Actions posted a comment on my PR:
Performance Alert
BenchmarkInsert/Sequential: 357.24x regression
BenchmarkCursor/Advance_50rows: 8,044.58x regression
A dramatic performance regression. The code worked perfectly and all tests passed. But it would have made my database essentially unusable.
This is why I benchmark everything.
Benchmarks Are a Learning Tool
I want to start with something that doesn’t get said enough: benchmarks teach you more about your code than almost anything else.
When you write a benchmark, you’re forced to think about:
- What operations are actually hot paths?
- How does performance scale with input size?
- Where are the hidden allocations?
When you run benchmarks regularly, you learn:
- Which abstractions have real costs
- How the Go runtime behaves under different conditions
- Why certain patterns exist in production code
I’ve learned more about Go’s memory model, garbage collector behavior, and standard library internals from benchmarking than from any documentation. Numbers don’t lie, they reveal what’s actually happening underneath your abstractions.
Setting Up Go Benchmarks
Go’s built-in benchmarking is straightforward. Here’s a basic example from VeryLightSQL:
func BenchmarkInsert(b *testing.B) {
b.Run("Sequential", func(b *testing.B) {
for range b.N {
b.StopTimer()
table, cleanup := setupBenchmarkTable(b)
b.StartTimer()
for j := range 100 {
table.Insert(createRow(int32(j)))
}
b.StopTimer()
cleanup()
}
})
}
Key patterns I follow:
- Use sub-benchmarks (
b.Run) to test variations - Stop the timer for setup/teardown operations
- Test realistic scenarios, not micro-operations
- Include multiple sizes to understand scaling behavior
Run benchmarks with:
go test -bench=. -benchmem
The -benchmem flag is crucial; it shows allocations per operation, which often matters more than raw speed.
Posting Benchmark Results on GitHub PRs
The real power comes from automated benchmark comparison. I use benchmark-action in my GitHub Actions workflow:
name: Benchmark
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version: '1.23'
- name: Run benchmarks
run: go test -bench=. -benchmem -run=^$ | tee benchmark-results.txt
- name: Compare benchmarks
uses: benchmark-action/github-action-benchmark@v1
with:
tool: 'go'
output-file-path: benchmark-results.txt
github-token: ${{ secrets.GITHUB_TOKEN }}
alert-threshold: '150%'
comment-on-alert: true
fail-on-alert: false
This setup:
- Runs benchmarks on every PR
- Compares against the
mainbranch baseline - Posts a comment if any benchmark regresses by more than 50%
- Shows the exact regression ratio for each affected benchmark
Here’s what the alert looks like on a PR:
⚠️ **Performance Alert** ⚠️
Possible performance regression was detected for benchmark.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold `1.50`.
| Benchmark | Previous | Current | Ratio |
|-----------|----------|---------|-------|
| BenchmarkInsert/Sequential | 51,441 ns/op | 18,376,824 ns/op | 357.24 |
This caught my drastic regression before it hit main. Without automated comparison, I might have assumed “the tests pass, ship it.”
Learning from Streame’s Benchmarking Setup
I use a more sophisticated setup in Streame, my stream processing library. There, I generate CPU and memory profiles as SVG visualizations:
#!/bin/bash
# benchmarks/take_benchmarks.sh
BENCHMARKS=$(go test ./... -list Benchmark | grep Benchmark)
for BENCHMARK in $BENCHMARKS; do
TIMESTAMP=$(date +"%Y-%m-%d_%H-%M")
# Run with CPU profiling
go test ./benchmarks -bench="^${BENCHMARK}$" \
-cpuprofile="${TIMESTAMP}_${BENCHMARK}_cpu.prof" \
-benchtime=5s
# Generate SVG visualization
go tool pprof -svg "${TIMESTAMP}_${BENCHMARK}_cpu.prof" \
> "${TIMESTAMP}_${BENCHMARK}_cpu.svg"
done
The SVG reports get uploaded as GitHub Actions artifacts, so I can visually inspect where time is being spent:
- name: Upload benchmark reports
uses: actions/upload-artifact@v4
with:
name: benchmark-reports
path: benchmarks/*.svg
retention-days: 30
This helps answer “why is it slow?” not just “is it slow?”
My Benchmark Design Philosophy
After maintaining benchmarks across multiple projects, here are patterns that work:
1. Benchmark Real Operations, Not Micro-Operations
Bad:
func BenchmarkMapAccess(b *testing.B) {
m := make(map[string]int)
m["key"] = 1
for range b.N {
_ = m["key"]
}
}
Better:
func BenchmarkFindKey(b *testing.B) {
table := setupTableWith200Rows()
b.ResetTimer()
for i := range b.N {
table.findKey(uint32(i % 200))
}
}
The second benchmark tests what users actually do. Micro-benchmarks are useful for understanding primitives, but integration-level benchmarks catch real regressions.
2. Test Multiple Scales
func BenchmarkSelectAll(b *testing.B) {
for _, rowCount := range []int{50, 100, 200, 500} {
b.Run(fmt.Sprintf("Rows_%d", rowCount), func(b *testing.B) {
table := setupTableWithRows(rowCount)
b.ResetTimer()
for range b.N {
_ = table.SelectAll()
}
})
}
}
This reveals O(n) vs O(n²) behaviors that single-size benchmarks miss.
3. Include Allocation Metrics
Always run with -benchmem. A function might be fast but allocate heavily, causing GC pressure in production:
BenchmarkInsert/Sequential-10 10 10,587,679 ns/op 4,521,088 B/op 52,341 allocs/op
Those 52,341 allocations per operation will hurt in a long-running service.
Not unrelated to benchmarking and Go GC, Twich’s blog on Memory Ballast worths a read.
4. Benchmark Your Hot Paths
Identify what operations happen most frequently. For a database:
- Key lookups (every query)
- Cursor advancement (every row read)
- Node serialization (every disk access)
For a web service:
- Request parsing
- Authentication checks
- Response serialization
Focus benchmark effort where performance matters most.
The Value of Benchmarking
Setting up benchmarks takes time. Is it worth it?
For VeryLightSQL, the benchmark suite caught:
- A 357x regression in insert operations
- An dramatic regression in cursor operations
- Memory allocation increases of 37x
One of these shipping to production would have been a disaster. The benchmark setup took maybe 2-3 hours. The alternative [Ex. debugging performance issues in production or, worse, not noticing them] would cost far more.
More importantly, benchmarks grow more valuable over time. Once the setup is done, every future change is automatically checked for performance. The upfront work keeps paying off.
Getting Started
If you’re not benchmarking yet, start simple:
- Write one benchmark for your most critical operation
- Run it locally before and after changes
- Add it to CI with basic comparison
- Expand coverage as you identify more hot paths
You don’t need a perfect setup from day one. Even manual go test -bench=. before merging catches obvious regressions.
The goal isn’t perfect benchmarks; it’s building the habit of measuring performance. Once you start seeing the numbers, you’ll never want to ship without them.
See the benchmark setup in action: VeryLightSQL benchmarks and Streame benchmarks.