Go optimization guide

Practical optimization workflow

Go optimization work is most reliable when each change has one measurable target, one code change, and one rollback path. Use profiling to pick the target first, then keep the change set small and verify with benchmarks and production-safe rollout checks.

Profiling-first loop

Use this loop for every optimization: collect profiles from realistic load, apply one change, then compare before and after. This avoids shipping “faster-looking” refactors that only move cost from CPU to memory or from throughput to tail latency.

curl -sS 'http://127.0.0.1:6060/debug/pprof/profile?seconds=30' -o cpu.pprof
curl -sS 'http://127.0.0.1:6060/debug/pprof/allocs' -o allocs.pprof
go tool pprof cpu.pprof

Concrete code examples

The sections below are intentionally grep-friendly. Each one shows code that often appears in reviews as a performance anti-pattern, followed by a safer replacement.

Preallocation in hot loops

When row counts are known or bounded, preallocating slices and maps usually removes a large amount of allocation churn.

// Bad: repeated growth and reallocation.
func collectBad(ids []int) map[int]struct{} {
	m := map[int]struct{}{}
	for _, id := range ids {
		m[id] = struct{}{}
	}
	return m
}

When a function must allocate its own output, reserve capacity once.

// Better: reserve capacity once.
func collectBetter(ids []int) map[int]struct{} {
	m := make(map[int]struct{}, len(ids))
	for _, id := range ids {
		m[id] = struct{}{}
	}
	return m
}

In tight loops, caller-managed reuse is usually the strongest pattern.

// Best when feasible: caller owns allocation and reuse.
func collectInto(dst map[int]struct{}, ids []int) map[int]struct{} {
	clear(dst)
	for _, id := range ids {
		dst[id] = struct{}{}
	}
	return dst
}

String building without fmt on hot paths

fmt.Sprintf is great for readability, but repeated formatting in request-critical loops often adds avoidable CPU and allocations.

Before replacing fmt with a faster formatter, confirm the value is needed at all. In many paths the best optimization is to delete unused string construction completely.

If the next layer can consume structured values, pass typed fields (for example user and count) through the call boundary instead of formatting text.

// Bad: formatting work on every call.
func lineBad(user string, n int) string {
	return fmt.Sprintf("user=%s count=%d", user, n)
}

When the formatted string is still required, a builder-based implementation is often cheaper on hot paths.

// Better: explicit builder and strconv.
func lineBetter(user string, n int) string {
	var b strings.Builder
	b.Grow(len(user) + 24)
	b.WriteString("user=")
	b.WriteString(user)
	b.WriteString(" count=")
	b.WriteString(strconv.Itoa(n))
	return b.String()
}

Another common win is to move reusable formatting outside loops or helper calls. If the message is static, create it once and reuse it instead of formatting each failure path.

// Bad: allocates and formats a new error each time.
func parseBad(tokens []string) error {
	for _, t := range tokens {
		if t == "" {
			return fmt.Errorf("bad token")
		}
	}
	return nil
}

For static messages, define the error once and reuse it.

This keeps allocation and formatting out of failure-heavy loops and makes the reuse intent explicit in reviews.

var ErrBadToken = errors.New("bad token")

// Better: reuse a sentinel error when context is static.
func parseBetter(tokens []string) error {
	for _, t := range tokens {
		if t == "" {
			return ErrBadToken
		}
	}
	return nil
}

Avoid interface boxing on hot paths

Unnecessary any and interface boxing can add allocation and dispatch overhead in tight loops.

// Bad: boxes to any on return path.
func sumBad(xs []int) any {
	var s int
	for _, x := range xs {
		s += x
	}
	return any(s)
}

Keep hot paths strongly typed unless dynamic behavior is required.

func sumBetter(xs []int) int {
	var s int
	for _, x := range xs {
		s += x
	}
	return s
}

Reuse http.Client and http.Transport

A shared client and transport keep connection reuse effective and reduce handshake and socket churn.

// Bad: new client/transport for each call.
func fetchBad(url string) (*http.Response, error) {
	c := &http.Client{Timeout: 2 * time.Second}
	return c.Get(url)
}

Prefer one shared transport and one shared client for the process lifetime.

// Better: one shared transport/client with tuned pool settings.
var sharedTransport = func() *http.Transport {
	t := http.DefaultTransport.(*http.Transport).Clone()
	t.MaxIdleConns = 200
	t.MaxIdleConnsPerHost = 100
	t.MaxConnsPerHost = 100
	return t
}()

var sharedClient = &http.Client{
	Timeout:   2 * time.Second,
	Transport: sharedTransport,
}

func fetchBetter(url string) (*http.Response, error) {
	return sharedClient.Get(url)
}

Reuse and tune http.Transport explicitly

Transport reuse is mandatory for connection pooling behavior. Per-request transport construction defeats keep-alive and raises dial and TLS overhead.

// Bad: allocates a fresh transport repeatedly.
func newTransportBad() *http.Transport { return &http.Transport{} }

Clone and tune a shared transport once for process lifetime.

var tunedTransport = func() *http.Transport {
	t := http.DefaultTransport.(*http.Transport).Clone()
	t.MaxIdleConns = 256
	t.MaxIdleConnsPerHost = 128
	t.MaxConnsPerHost = 128
	return t
}()

Always drain and close response bodies

Closing without draining can reduce keep-alive reuse on call paths that do not fully consume the body.

resp, err := sharedClient.Get(url)
if err != nil {
	return err
}
defer resp.Body.Close()

if _, err := io.Copy(io.Discard, resp.Body); err != nil {
	return err
}

Bounded goroutines with backpressure

Unbounded goroutine creation often hides queueing problems until memory and tail latency spike.

// Bad: unbounded worker creation.
for _, job := range jobs {
	go process(job)
}

Use fixed workers and a bounded queue to cap concurrency and apply backpressure.

// Better: fixed workers + bounded queue.
jobsCh := make(chan Job, 256)
var wg sync.WaitGroup
for i := 0; i < 16; i++ {
	wg.Add(1)
	go func() {
		defer wg.Done()
		for job := range jobsCh {
			process(job)
		}
	}()
}

for _, job := range jobs {
	select {
	case jobsCh <- job:
	default:
		return errors.New("queue full")
	}
}
close(jobsCh)
wg.Wait()

Avoid unbounded queue growth

Queueing without explicit limits converts load spikes into memory growth and tail-latency failures.

// Bad: producer can enqueue indefinitely.
go func() {
	for {
		jobs <- Job{}
	}
}()

Use bounded queues and explicit overload behavior.

select {
case jobs <- Job{}:
default:
	return errors.New("overloaded")
}

sync.Pool usage with reset discipline

sync.Pool is useful for temporary objects under load, but pooled buffers must be reset before reuse.

var bufPool = sync.Pool{New: func() any { return new(bytes.Buffer) }}

// Bad: old content and capacity behavior leak across uses.
func writeBad(w io.Writer, s string) {
	b := bufPool.Get().(*bytes.Buffer)
	b.WriteString(s)
	_, _ = w.Write(b.Bytes())
	bufPool.Put(b)
}

Reset pooled buffers before reuse.

// Better: reset before use.
func writeBetter(w io.Writer, s string) {
	b := bufPool.Get().(*bytes.Buffer)
	b.Reset()
	b.WriteString(s)
	_, _ = w.Write(b.Bytes())
	bufPool.Put(b)
}

Typed JSON decode instead of map[string]any

Typed decode avoids repeated dynamic type assertions and typically reduces allocation pressure in high-volume paths.

// Bad: dynamic map decoding.
func decodeBad(r io.Reader) (map[string]any, error) {
	var v map[string]any
	err := json.NewDecoder(r).Decode(&v)
	return v, err
}

Prefer typed decode for stable request and response contracts.

type payload struct {
	User  string `json:"user"`
	Count int    `json:"count"`
}

// Better: typed struct decode.
func decodeBetter(r io.Reader) (payload, error) {
	var p payload
	dec := json.NewDecoder(r)
	dec.DisallowUnknownFields()
	if err := dec.Decode(&p); err != nil {
		return payload{}, err
	}
	return p, nil
}

If profiles still show JSON as a dominant hotspot after typed decoding, benchmark a faster JSON library with your real payloads. Keep the swap behind a small package boundary so you can revert quickly and keep behavior deterministic.

// Keep call sites stable behind a small adapter.
type JSONCodec interface {
	Unmarshal([]byte, any) error
	Marshal(any) ([]byte, error)
}

// Default implementation can wrap encoding/json.
type StdJSONCodec struct{}

func (StdJSONCodec) Unmarshal(b []byte, v any) error { return json.Unmarshal(b, v) }
func (StdJSONCodec) Marshal(v any) ([]byte, error)   { return json.Marshal(v) }

When evaluating alternatives, benchmark both speed and compatibility behavior, especially number handling, unknown fields, and HTML escaping defaults.

Keep reflection out of hot paths

Reflection is useful for tooling and generic wiring, but repeated runtime field lookup in request paths is expensive.

// Bad: field lookup by name for each call.
func getFieldBad(v any, name string) (any, bool) {
	rv := reflect.ValueOf(v)
	f := rv.FieldByName(name)
	if !f.IsValid() {
		return nil, false
	}
	return f.Interface(), true
}

Prefer typed dispatch or generated mapping in hot paths.

type userRow struct{ ID int; Name string }

func getFieldBetter(u userRow, name string) (any, bool) {
	switch name {
	case "ID":
		return u.ID, true
	case "Name":
		return u.Name, true
	default:
		return nil, false
	}
}

Avoid reflection-heavy ORM mapping on hot endpoints

Row-by-row reflection mappers can dominate CPU on large list and report endpoints.

// Bad: reflective row mapping in request path.
func mapRowBad(dst any, row map[string]any) {}

Prefer typed scan/mapping for high-volume paths.

type accountRow struct {
	ID   string
	Name string
}

Avoid holding locks during I/O

Contention climbs quickly when locks cover network or disk operations.

var mu sync.Mutex
var cfg Config

// Bad: lock held during network I/O.
func callBad() error {
	mu.Lock()
	defer mu.Unlock()
	_, err := sharedClient.Get(cfg.URL)
	return err
}

Take a local snapshot, then release the lock before external I/O.

// Better: copy needed state, then unlock before I/O.
func callBetter() error {
	mu.Lock()
	url := cfg.URL
	mu.Unlock()
	_, err := sharedClient.Get(url)
	return err
}

Keep logging lazy on request paths

Compute expensive log values only when the selected log level will emit the record.

// Bad: expensive value always computed.
logger.Debug("request", "summary", summarizeLargeObject(obj))

Use lazy value construction so work happens only when the log line is emitted.

type lazySummary struct{ v LargeObject }

func (l lazySummary) LogValue() slog.Value {
	return slog.StringValue(summarizeLargeObject(l.v))
}

// Better: value computed only when emitted.
logger.Debug("request", "summary", lazySummary{v: obj})

Avoid expensive caller or stack capture in hot logs

Source and stack metadata is useful, but collecting it on high-frequency paths can become a measurable CPU tax.

// Bad: computes expensive diagnostics regardless of level policy.
logger.Debug("request", "stack", debug.Stack())

Gate expensive diagnostics behind level checks or slower paths.

if logger.Enabled(context.Background(), slog.LevelDebug) {
	logger.Debug("request", "stack", string(debug.Stack()))
}

Keep log volume bounded with levels and sampling

Per-request info-level logs can become dominant CPU and I/O cost at scale.

// Bad: logs every request with expensive attributes.
slog.Info("request", "url", r.URL.String(), "headers", r.Header)

Prefer structured fields with strict level policy and sampling at hot edges.

slog.Debug("request", "url", &r.URL)

Avoid defer resource cleanup across long loops

defer is usually fine, but deferring close in a long loop delays cleanup until function return.

// Bad: files remain open until the end.
for _, name := range files {
	f, err := os.Open(name)
	if err != nil {
		return err
	}
	defer f.Close()
}

Use a per-iteration scope so each resource closes promptly.

// Better: close per-iteration.
for _, name := range files {
	f, err := os.Open(name)
	if err != nil {
		return err
	}
	func() {
		defer f.Close()
		_ = useFile(f)
	}()
}

Prefer values over slices of pointers when mutation is not needed

Slices of pointers add indirection and can reduce cache locality on tight iteration.

type item struct{ X, Y int }

// Bad: allocates one object per element.
func makePtrsBad(n int) []*item {
	out := make([]*item, 0, n)
	for i := 0; i < n; i++ {
		out = append(out, &item{X: i})
	}
	return out
}

Prefer contiguous value slices when ownership allows it.

func makeValsBetter(n int) []item {
	out := make([]item, n)
	for i := 0; i < n; i++ {
		out[i] = item{X: i}
	}
	return out
}

Prefer streaming over read-all copy chains

Repeated full-buffer reads and conversions create avoidable allocations on request paths.

// Bad: full body + extra copy to string.
b, _ := io.ReadAll(r.Body)
s := string(b)
_ = s

Stream decode or transform whenever full materialization is unnecessary.

var p payload
_ = json.NewDecoder(r.Body).Decode(&p)

Validate escape behavior for hot packages

Unexpected heap escapes can increase allocation and GC pressure.

go build -gcflags='all=-m' ./...

Use escape analysis as a diagnostics step and then validate with alloc profiles.

Avoid context.WithValue as an option bag

Large config values in context hide dependencies and can increase memory retention.

// Bad: opaque option transport via context.
ctx = context.WithValue(ctx, cfgKey, hugeConfig)

Prefer explicit parameters or a typed request struct.

type requestCtx struct {
	Ctx context.Context
	Cfg *Config
}

GC latency from large live heap

GC pain is often a live-heap retention problem, not only an allocation-rate problem. Unbounded caches and long-lived references keep memory alive and increase GC work.

// Bad: unbounded cache growth.
var blobCache = map[string][]byte{}

func loadBad(k string) []byte {
	if v, ok := blobCache[k]; ok {
		return v
	}
	v := expensiveFetch(k)
	blobCache[k] = v
	return v
}

Prefer bounded retention so stale objects can be collected.

type boundedCache struct {
	max int
	q   []string
	m   map[string][]byte
}

func (c *boundedCache) put(k string, v []byte) {
	if _, ok := c.m[k]; !ok && len(c.q) == c.max {
		evict := c.q[0]
		c.q = c.q[1:]
		delete(c.m, evict)
	}
	if _, ok := c.m[k]; !ok {
		c.q = append(c.q, k)
	}
	c.m[k] = v
}

Reduce syscall volume with buffered I/O

Frequent small reads and writes increase syscall overhead and can dominate throughput-sensitive paths.

// Bad: one write syscall per line.
func writeLinesBad(f *os.File, lines []string) error {
	for _, s := range lines {
		if _, err := f.WriteString(s + "\n"); err != nil {
			return err
		}
	}
	return nil
}

Batch writes with buffering to reduce kernel crossings.

func writeLinesBetter(f *os.File, lines []string) error {
	w := bufio.NewWriter(f)
	for _, s := range lines {
		if _, err := w.WriteString(s); err != nil {
			return err
		}
		if err := w.WriteByte('\n'); err != nil {
			return err
		}
	}
	return w.Flush()
}

Fix algorithmic complexity before micro-optimizing

An O(n²) hot path will dominate runtime regardless of low-level tuning.

// Bad: O(n²) duplicate check.
func hasDupBad(xs []string) bool {
	for i := range xs {
		for j := i + 1; j < len(xs); j++ {
			if xs[i] == xs[j] {
				return true
			}
		}
	}
	return false
}

Use a better data structure first.

func hasDupBetter(xs []string) bool {
	seen := make(map[string]struct{}, len(xs))
	for _, x := range xs {
		if _, ok := seen[x]; ok {
			return true
		}
		seen[x] = struct{}{}
	}
	return false
}

Improve cache locality and reduce false sharing

Memory layout can limit throughput even when locks and allocations look fine.

// Bad: adjacent atomics can contend on the same cache line.
type counter struct{ n uint64 }

var ctrs = make([]counter, 64)

func incBad(i int) { atomic.AddUint64(&ctrs[i].n, 1) }

Pad independent hot counters when profiling points to false sharing.

type paddedCounter struct {
	n uint64
	_ [56]byte // 64-byte cache line on common amd64 systems
}

var ctrsPadded = make([]paddedCounter, 64)

func incBetter(i int) { atomic.AddUint64(&ctrsPadded[i].n, 1) }

Avoid large value copies on hot paths

Passing large structs by value can add hidden copy cost in frequently called code.

type big struct {
	a [1024]byte
	b [1024]byte
}

// Bad: value copy each call.
func scoreBad(x big) int { return int(x.a[0]) + int(x.b[0]) }

Use pointer parameters where measurement shows copy cost is meaningful.

func scoreBetter(x *big) int { return int(x.a[0]) + int(x.b[0]) }

Prevent goroutine leaks with cancellation

Background goroutines need a clear stop path. Without cancellation, blocked receives and sends can leak indefinitely.

// Bad: no cancellation path.
func watchBad(ch <-chan string) {
	go func() {
		msg := <-ch
		_ = msg
	}()
}

Use context or done channels to guarantee shutdown behavior.

func watchBetter(ctx context.Context, ch <-chan string) {
	go func() {
		select {
		case msg := <-ch:
			_ = msg
		case <-ctx.Done():
			return
		}
	}()
}

Avoid channel busy loops and partial deadlocks

select with a default branch can create CPU spin loops when no work is available.

// Bad: spins at 100% CPU when channel is empty.
func runBad(ch <-chan Job) {
	for {
		select {
		case j := <-ch:
			process(j)
		default:
		}
	}
}

Prefer blocking receives with explicit shutdown conditions.

func runBetter(ch <-chan Job, done <-chan struct{}) {
	for {
		select {
		case j, ok := <-ch:
			if !ok {
				return
			}
			process(j)
		case <-done:
			return
		}
	}
}

Keep startup paths light

Heavy init work can slow cold start and autoscaling responsiveness.

// Bad: expensive eager startup.
var reBad = regexp.MustCompile(veryLargePattern)

func init() {
	loadBigDictionary()
}

Move non-critical setup to lazy or background initialization.

var (
	reOnce sync.Once
	reGood *regexp.Regexp
)

func getRegexp() *regexp.Regexp {
	reOnce.Do(func() { reGood = regexp.MustCompile(veryLargePattern) })
	return reGood
}

Use string primitives before regex in hot paths

Regex engines are powerful but expensive compared to direct string operations for simple prefix, suffix, or containment checks.

// Bad
if re.MatchString(s) {
	handle()
}

Prefer strings helpers when they express the same rule.

if strings.HasPrefix(s, "acct:") {
	handle()
}

Build and runtime defaults

These defaults keep artifacts reproducible and predictable across developer machines and CI. Apply them as release defaults, then use explicit debug overrides for investigation builds.

-trimpath for reproducible paths

Use -trimpath to remove machine-local paths from build artifacts. This reduces environment-specific differences between developer and CI outputs.

go build -trimpath ./cmd/myservice

-buildvcs=false for deterministic build metadata

Use -buildvcs=false when you want deterministic binaries across detached checkouts and CI metadata variations.

go build -buildvcs=false ./cmd/myservice

-ldflags='-s -w' for release binary size

Use -ldflags='-s -w' for release builds to strip symbol and DWARF data and reduce artifact size.

go build -ldflags='-s -w' ./cmd/myservice

Keep optimized and debug builds separate

Keep release defaults optimized and reproducible. Use a separate debug target for investigation flags such as disabled inlining and extra compiler diagnostics.

make build
make build-debug DEBUG_GCFLAGS='all=-N -l -m'

-pgo for profile-guided optimization

Enable -pgo only with representative production-like profiles and keep a non-PGO fallback build available.

go build -pgo=default.pgo ./cmd/myservice

GOAMD64 for CPU baseline tuning

GOAMD64 can improve throughput on newer hardware, but it also raises minimum CPU requirements. Match the level to your oldest deployment target.

GOAMD64=v3 go build ./cmd/myservice

CGO_ENABLED=0 with netgo,osusergo for portable static behavior

Use pure-Go DNS and user lookup behavior when portability and minimal runtime dependencies matter more than libc-specific resolver behavior.

CGO_ENABLED=0 go build -tags='netgo,osusergo' ./cmd/myservice

Tune GOGC for GC CPU vs memory tradeoff

GOGC changes GC frequency. Lower values reduce peak heap at the cost of more GC CPU; higher values trade memory for fewer collections.

GOGC=50 ./myservice
GOGC=150 ./myservice

-mod=readonly for dependency stability

Keep dependency updates explicit. Use -mod=readonly in normal build and test flows, then run go mod tidy only when intentionally updating dependencies.

go test -mod=readonly ./...
go mod tidy
go test -mod=readonly ./...

GOMEMLIMIT and GOMAXPROCS for container runtime behavior

Validate runtime memory and CPU settings in staging with realistic load. Go 1.25+ adjusts GOMAXPROCS from container CPU limits by default, but services should still verify effective runtime values and tail-latency behavior.

GOMEMLIMIT=1GiB ./myservice
GOMAXPROCS=4 ./myservice

Benchmark and comparison workflow

Use repeated benchmark runs and compare with benchstat rather than relying on a single benchmark output.

go test ./... -run=^$ -bench=BenchmarkHotPath -benchmem -count=10 > old.txt
go test ./... -run=^$ -bench=BenchmarkHotPath -benchmem -count=10 > new.txt
benchstat old.txt new.txt

Trace for blocking, network, and syscall analysis

Use trace when CPU or alloc profiles do not fully explain tail latency, blocking behavior, or scheduler delays.

curl -sS 'http://127.0.0.1:6060/debug/pprof/trace?seconds=5' -o trace.out
go tool trace trace.out
go tool trace -pprof=sync trace.out > sync.pprof
go tool trace -pprof=net trace.out > net.pprof
go tool trace -pprof=syscall trace.out > syscall.pprof

Runtime metrics and CI guardrails

Track runtime metrics continuously and gate regressions in CI. At minimum, track goroutine count, heap growth, allocation rate, and GC CPU pressure, then pair those with benchmark trend checks.

Use canary rollout plus profile comparison for high-impact performance changes.

# Example: minimal Prometheus alert ideas for Go runtime behavior.
groups:
  - name: go-runtime
    rules:
      - alert: GoGoroutinesHigh
        expr: go_goroutines > 2000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High goroutine count"
      - alert: GoHeapGrowing
        expr: increase(go_memstats_heap_inuse_bytes[15m]) > 200000000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Heap in-use growth trend"
# Example: CI perf gate step (baseline artifact + new run).
go test ./... -run=^$ -bench='BenchmarkHotPath$' -benchmem -count=10 > new.txt
benchstat old.txt new.txt
# Example: canary verification loop.
curl -sS 'http://127.0.0.1:6060/debug/pprof/profile?seconds=20' -o canary-cpu.pprof
curl -sS 'http://127.0.0.1:6060/debug/pprof/heap?gc=1' -o canary-heap.pprof
go tool pprof -top canary-cpu.pprof

Continuous profiling in production

Point-in-time profiling is useful for incidents, but recurring regressions are easier to catch with continuous profiling.

Use a pprof-compatible pipeline (for example Pyroscope or Parca) and verify two properties before broad rollout: overhead on representative load and strict access control for collected profiling data.

// Example: keep net/http/pprof on an internal-only admin listener.
func startDebugServer() {
	mux := http.NewServeMux()
	mux.HandleFunc("/debug/pprof/", pprof.Index)
	mux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
	mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
	mux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
	mux.HandleFunc("/debug/pprof/trace", pprof.Trace)

	go func() {
		_ = http.ListenAndServe("127.0.0.1:6060", mux)
	}()
}
# Example: periodic profile capture job.
while true; do
  ts=$(date +%Y%m%d-%H%M%S)
  curl -sS "http://127.0.0.1:6060/debug/pprof/profile?seconds=20" -o "cpu-$ts.pprof"
  sleep 300
done

Avoid unnecessary cgo in request-critical paths

cgo can be the right choice for specific capabilities, but boundary crossings and operational complexity can outweigh benefits on hot paths.

Prefer pure Go implementations unless profiling and benchmark data justify cgo.

Avoid unsafe micro-optimizations without evidence

unsafe may reduce copies in narrow cases, but it increases correctness and maintenance risk.

// Bad: unsafe conversion couples code to runtime representation details.
func bytesToStringBad(b []byte) string {
	return *(*string)(unsafe.Pointer(&b))
}

Prefer safe conversions unless performance evidence and safety constraints are both explicit.

func bytesToStringBetter(b []byte) string { return string(b) }

Validate monetary CLI inputs without per-item big.Rat churn

Command validation can become a measurable cost when large .bus batches are preflighted before dispatch. In bus/internal/dispatch, the journal and bank validators currently parse monetary strings with fresh big.Rat values on each posting or --set amount=... field. That preserves exactness, but it also introduces high allocation rates when repeated across hundreds of commands.

Benchmarks in bus/internal/dispatch/run_bench_test.go show this pattern clearly on a representative development machine: BenchmarkValidateJournalAddSingle is about 859 ns/op with 28 allocs/op, BenchmarkValidateBankAddTransactionsSingle is about 314 ns/op with 7 allocs/op, and batch-level validation scales that overhead linearly (BenchmarkValidateBusfileCommandsJournalAdd about 214641 ns/op with 7168 allocs/op for 256 commands).

Use this benchmark loop to verify baseline and any optimization candidate before changing parser behavior:

go test ./internal/dispatch -run '^$' \
  -bench 'BenchmarkValidate(JournalAddSingle|BankAddTransactionsSingle|BusfileCommandsJournalAdd|BusfileCommandsBankAddTransactions)$' \
  -benchmem

The safer optimization direction is to keep exact decimal semantics while reducing temporary object churn, for example by using a lower-allocation decimal parser for validation-only checks and reusing parse buffers where possible. Keep error messages and accepted input formats stable, and rerun the same benchmarks plus command-level tests after each parser change.

Treat .bus preflight parsing as a measurable hot path

For larger batch runs, parsing and tokenization can be a first-order cost before command execution even starts. In bus/internal/dispatch, collectBusfileCommands scans and tokenizes each logical line, and benchmark results show notable allocation pressure on this path.

Current benchmarks in bus/internal/dispatch/run_bench_test.go show BenchmarkTokenizeBusLineJournalAdd around 590 ns/op with 15 allocs/op, and BenchmarkCollectBusfileCommands around 491824 ns/op with 385356 B/op and 8709 allocs/op for a 512-line busfile. This suggests optimization work should target parser memory churn first (token buffers, token slice growth, and repeated temporary string construction), not only downstream dispatch.

Use this benchmark command when iterating on parser changes:

go test ./internal/dispatch -run '^$' \
  -bench 'Benchmark(TokenizeBusLineJournalAdd|CollectBusfileCommands)$' \
  -benchmem

Maintain exact parser behavior while optimizing: quoted/escaped token handling, disallowed token diagnostics, include-cycle detection, and line-numbered syntax error reporting must remain unchanged. Pair parser benchmarks with existing busfile syntax tests on each change.

Avoid redundant filesystem checks in command discovery scans

CLI command discovery often starts with ReadDir and then filters entries by naming convention. A common performance trap is calling os.Stat again for candidates that were already surfaced as directory entries, especially when PATH directories are large.

In bus/internal/dispatch, BenchmarkListSubcommandsDensePath currently measures around 1323191 ns/op with 302920 B/op and 3458 allocs/op for a directory containing 256 command-like files and 1024 non-command files. This is a useful signal that subcommand listing can become a visible cost in help/usage paths for larger developer environments.

Use this benchmark to validate discovery-path optimizations:

go test ./internal/dispatch -run '^$' \
  -bench 'BenchmarkListSubcommandsDensePath$' \
  -benchmem

Preferred optimization direction is to reduce duplicate metadata probes and short-lived allocations while preserving behavior: PATH left-to-right precedence, deterministic dedupe, and lexicographic output sorting must remain identical.

Avoid unconditional overlay existence probes on TxFS read paths

In bus/internal/txfs, both FS.Open and FS.Stat currently probe overlay/<rel> with os.Stat before falling back to the base workspace path. For unchanged files that only exist in the base tree, this adds an extra filesystem metadata syscall and path work on every read/stat call.

Benchmarks in bus/internal/txfs/txfs_bench_test.go on Apple M2 Max (darwin/arm64) show the current shape:

  • BenchmarkOpenReadExistingPath/file.txt about 100569 ns/op, 840 B/op, 8 allocs/op
  • BenchmarkOpenReadExistingPath/deep/nested/tree/path/file.txt about 100664 ns/op, 1000 B/op, 8 allocs/op
  • BenchmarkStatExistingPath/file.txt about 3216 ns/op, 928 B/op, 7 allocs/op
  • BenchmarkStatExistingPath/deep/nested/tree/path/file.txt about 3410 ns/op, 1104 B/op, 7 allocs/op

Use this benchmark loop when iterating on read-path fast paths:

go test ./internal/txfs -run '^$' \
  -bench 'Benchmark(OpenReadExistingPath|StatExistingPath)$' \
  -benchmem

Preferred optimization direction is to avoid unconditional overlay metadata probes for clearly unchanged paths (for example, by using change/tombstone tracking as a fast-path gate) while preserving behavior:

  • overlay content must still take precedence over base files,
  • tombstoned files/trees must continue to return os.ErrNotExist,
  • no stale reads are allowed after overlay writes/renames/removes.

Avoid full PATH rescans per command during busfile preflight

In bus/internal/dispatch, preflightDispatchTargets currently calls lookPathEnv("bus-"+target, env) for each command. Each lookup re-reads and re-splits PATH, then linearly probes directories again. For batch runs with many unique targets and wider PATH values, this creates avoidable O(commands*path_dirs) scan work plus repeated allocation churn.

Benchmarks in bus/internal/dispatch/run_bench_test.go on Apple M2 Max (darwin/arm64) show this shape:

  • BenchmarkPreflightDispatchTargetsRepeatedLookups about 477254 ns/op, 126983 B/op, 1280 allocs/op
  • BenchmarkPreflightDispatchTargetsUniqueLookups about 494364 ns/op, 127030 B/op, 1280 allocs/op
  • BenchmarkPreflightDispatchTargetsUniqueLookupsWidePath about 3466793 ns/op, 1203652 B/op, 8388 allocs/op

Use this benchmark loop when iterating on preflight path-resolution changes:

go test ./internal/dispatch -run '^$' \
  -bench 'BenchmarkPreflightDispatchTargets(Repeated|Unique)Lookups(WidePath)?$' \
  -benchmem

Preferred optimization direction is to prepare path lookup state once per preflight (for example, a one-time executable index or cached PATH split/search structure) and reuse it across commands. Preserve behavior while optimizing:

  • in-process runner precedence and shell-lookup gating semantics must remain unchanged,
  • PATH left-to-right resolution behavior must stay identical,
  • missing-target diagnostics and exit codes must remain stable.

Avoid full change-map sort and per-file sync-heavy commit loops in TxFS

In bus/internal/txfs, FS.Commit builds a full key slice from fs.changes, sorts it, and then applies each replace/delete entry with filesystem operations that include file syncs. For large replace sets, this compounds sort work with high syscall churn and makes commit latency scale steeply with changed-file count.

Benchmarks in bus/internal/txfs/txfs_bench_test.go on Apple M2 Max (darwin/arm64) show the current shape:

  • BenchmarkCommitReplaceManyFiles/files_10 about 45771068 ns/op, 355190 B/op, 234 allocs/op
  • BenchmarkCommitReplaceManyFiles/files_100 about 682742000 ns/op, 3564589 B/op, 2477 allocs/op
  • BenchmarkCommitReplaceManyFiles/files_500 about 3482517125 ns/op, 18012672 B/op, 14030 allocs/op

Use this benchmark loop when iterating on commit-path changes:

go test ./internal/txfs -run '^$' \
  -bench 'BenchmarkCommitReplaceManyFiles$' \
  -benchmem

Preferred optimization direction is to reduce repeated commit bookkeeping and syscall overhead while preserving transactional safety. Guardrails:

  • keep deterministic and safe application of deletes/replaces (no partial write visibility),
  • preserve atomic replace behavior for files (tmp+rename or equivalent safety),
  • do not regress tombstone semantics or overlay-to-root precedence,
  • retain crash-recovery expectations for fs provider transaction flow.

Avoid full-content compares for unchanged-file detection in workspace merge paths

In bus/internal/dispatch, mergeWorkspaceChangesToTxFS walks both trees and calls filesEqual for same-path files. filesEqual opens both files and performs full byte-by-byte reads for equal-content files. For unchanged trees, this means merge cost scales with total bytes scanned even when no writes are produced.

Benchmarks in bus/internal/dispatch/run_bench_test.go on Apple M2 Max (darwin/arm64) show this shape:

  • BenchmarkMergeWorkspaceChangesToTxFSUnchangedTree/files_64_size_4096 about 24599246 ns/op, 422585 B/op, 2794 allocs/op
  • BenchmarkMergeWorkspaceChangesToTxFSUnchangedTree/files_128_size_65536 about 194114535 ns/op, 852587 B/op, 5606 allocs/op
  • Supporting primitive cost: BenchmarkFilesEqualSameSizeEqualContent about 237542 ns/op, 1072 B/op, 10 allocs/op

Use this benchmark loop when iterating on unchanged-file detection changes:

go test ./internal/dispatch -run '^$' \
  -bench 'Benchmark(MergeWorkspaceChangesToTxFSUnchangedTree|FilesEqualSameSize(EqualContent|DifferentContent))$' \
  -benchmem

Preferred optimization direction is to short-circuit unchanged files before full-content scan when safe (for example, stable metadata gate and optional digest path) while preserving transactional correctness. Guardrails:

  • never skip real content differences (no false-equal outcomes),
  • maintain exact replace/delete behavior in the TxFS overlay,
  • keep deterministic behavior across filesystems and platforms,
  • preserve existing error handling and non-regular file filtering semantics.

Avoid full workspace clone and full-tree snapshot/merge for each fs-transaction command

In bus/internal/dispatch, runModuleViaTempWorkspaceAndMerge currently performs a full workspace copy into a temp directory, captures a full snapshot, runs the module, then walks/merges the full tree back into TxFS. This happens per command in provider=fs mode and scales with workspace size even when nothing changes or only one file changes.

Benchmarks in bus/internal/dispatch/run_bench_test.go on Apple M2 Max (darwin/arm64) show this shape:

  • BenchmarkRunModuleViaTempWorkspaceAndMerge/unchanged_files_32_size_4096 about 153254095 ns/op, 1411168 B/op, 2749 allocs/op
  • BenchmarkRunModuleViaTempWorkspaceAndMerge/mutate_one_file_files_32_size_4096 about 163503684 ns/op, 1447438 B/op, 2782 allocs/op
  • BenchmarkRunModuleViaTempWorkspaceAndMerge/unchanged_files_64_size_16384 about 421904750 ns/op, 2831296 B/op, 5493 allocs/op
  • BenchmarkRunModuleViaTempWorkspaceAndMerge/mutate_one_file_files_64_size_16384 about 460369958 ns/op, 2866765 B/op, 5527 allocs/op

Use this benchmark loop when iterating on fs-transaction staging changes:

go test ./internal/dispatch -run '^$' \
  -bench 'BenchmarkRunModuleViaTempWorkspaceAndMerge$' \
  -benchmem

Preferred optimization direction is change-scoped staging/merge instead of whole-tree clone/diff per command, while preserving behavior:

  • preserve exact command-side filesystem view semantics for in-process fs transactions,
  • keep .git and .bus/tx ignore semantics unchanged,
  • retain deterministic merge/delete behavior and transactional failure handling,
  • do not regress final committed workspace results or diagnostics.

Avoid repeated OpenFile setup work on already-materialized TxFS paths

In bus/internal/txfs, repeated FS.OpenFile calls on the same path still execute path normalization/index bookkeeping and overlay-directory setup logic each time. In append-heavy command loops this overhead is measurable even after the file is already materialized in overlay.

Benchmarks in bus/internal/txfs/txfs_bench_test.go on Apple M2 Max (darwin/arm64) show the current steady-state shape:

  • BenchmarkOpenFileRepeatedExistingPath/file.txt about 82299 ns/op, 520 B/op, 7 allocs/op
  • BenchmarkOpenFileRepeatedExistingPath/deep/nested/tree/path/file.txt about 85468 ns/op, 792 B/op, 7 allocs/op

Use this benchmark loop when iterating on TxFS write-path fast paths:

go test ./internal/txfs -run '^$' \
  -bench 'BenchmarkOpenFileRepeatedExistingPath$' \
  -benchmem

Preferred optimization direction is a lower-overhead fast path for already-materialized files (for example, minimizing repeated per-call normalization/index churn) while preserving correctness:

  • keep tombstone semantics and delete precedence unchanged,
  • keep overlay-write precedence and path safety checks intact,
  • preserve behavior for deep paths and missing parent directories,
  • avoid stale or bypassed change tracking in commit/rollback flows.

Avoid recursive key re-sorting and custom re-marshaling in OpenAPI JSON generation

In bus-api/internal/server/server.go, openAPIDocBytes currently calls sortMapKeysRecursive and then marshals a custom sortedMap tree where each MarshalJSON step calls json.Marshal again per key/value entry. This introduces extra full-tree allocations and repeated encoding work.

Benchmarks in bus-api/internal/server/perf_bench_test.go on Apple M2 Max (darwin/arm64) show this shape:

  • BenchmarkOpenAPIDoc_EncodeWithRecursiveSort about 3985584 ns/op, 2828501 B/op, 23927 allocs/op
  • BenchmarkOpenAPIDoc_EncodeDirectMap_NoRecursiveSort about 1228124 ns/op, 1418766 B/op, 11455 allocs/op

Use this benchmark loop when iterating on OpenAPI serialization changes:

go test ./internal/server -run '^$' \
  -bench 'BenchmarkOpenAPIDoc_' \
  -benchmem

Preferred optimization direction is to rely on deterministic map encoding from encoding/json and/or cache rendered OpenAPI bytes per server/options tuple instead of re-sorting/re-encoding on each call. Guardrails:

  • preserve deterministic OpenAPI output across repeated runs (NFR-API-001),
  • keep JSON schema/path content unchanged (no dropped keys or operations),
  • preserve pretty-printed output shape expected by CLI and HTTP clients,
  • do not introduce shared mutable cache state races across requests.

Avoid per-event JSON unmarshal+marshal round-trips in SSE emission

In bus-api/internal/server/server.go, handleEvents currently unmarshals each event payload into MutationEvent and then marshals it again before writing the SSE frame. For events already emitted as JSON payloads by the same process, this doubles encode/decode work per message.

Benchmarks in bus-api/internal/server/perf_bench_test.go on Apple M2 Max (darwin/arm64) show this shape:

  • BenchmarkSSEWrite_CurrentRoundTrip about 1728 ns/op, 849 B/op, 21 allocs/op
  • BenchmarkSSEWrite_RawPayload_WithValidation about 406.1 ns/op, 0 B/op, 0 allocs/op

Use this benchmark loop when iterating on SSE write-path changes:

go test ./internal/server -run '^$' \
  -bench 'BenchmarkSSEWrite_' \
  -benchmem

Preferred optimization direction is framing validated raw payload bytes directly (data: <json>\n\n) rather than decoding/encoding every event. Guardrails:

  • preserve SSE framing exactly (data: line + blank line terminator),
  • preserve JSON payload semantics delivered to clients,
  • keep behavior for malformed payloads deterministic (drop/skip or error path as designed),
  • avoid blocking regressions in long-lived stream handling.

Avoid full schema JSON unmarshal when only primary key fields are needed

In bus-api/internal/server/handlers.go, handleResourceRow PATCH/DELETE decodes whole schema JSON on each request to read primaryKey.values and build the key map for row mutation calls. Decoding the full schema object (including large fields) on every mutation request adds avoidable CPU and allocation cost on hot write paths.

Benchmarks in bus-api/internal/server/perf_bench_test.go on Apple M2 Max (darwin/arm64) show this shape:

  • BenchmarkPKLookup_FullSchemaUnmarshalPerRequest about 67033 ns/op, 60360 B/op, 1432 allocs/op
  • BenchmarkPKLookup_PrimaryKeyOnlyUnmarshalPerRequest about 24441 ns/op, 816 B/op, 16 allocs/op
  • BenchmarkPKLookup_CachedPrimaryKeyFields_NoUnmarshal about 88.79 ns/op, 336 B/op, 2 allocs/op

Use this benchmark loop when iterating on row-mutation key mapping changes:

go test ./internal/server -run '^$' \
  -bench 'BenchmarkPKLookup_' \
  -benchmem

Preferred optimization direction is decoding only primaryKey.values (instead of full schema) or caching per-resource PK fields with invalidation on schema mutations. Guardrails:

  • preserve primary key field order exactly when mapping URL row keys to field names,
  • preserve error behavior when schema JSON is absent/invalid or key lengths differ,
  • avoid stale PK mappings after schema updates (cache invalidation on schema-changing operations),
  • keep read-only and dry-run mutation semantics unchanged.

Avoid rebuilding header index maps for every primary-key row filter

In bus-api/internal/server/backend.go, filterRowsByKey currently rebuilds a map[string]int from the header row on each call before scanning rows. On read-heavy paths where primary key width is small and stable, this repeated map construction adds avoidable allocations and CPU per request.

Benchmarks in bus-api/internal/server/perf_bench_test.go on Apple M2 Max (darwin/arm64) show this shape:

  • BenchmarkFilterRowsByKey_CurrentHeaderMap about 43415 ns/op, 1432 B/op, 6 allocs/op
  • BenchmarkFilterRowsByKey_LinearHeaderScan about 11572 ns/op, 48 B/op, 1 allocs/op

Use this benchmark loop when iterating on row-filter key-index lookup changes:

go test ./internal/server -run '^$' \
  -bench 'BenchmarkFilterRowsByKey_' \
  -benchmem

Preferred optimization direction is to avoid map allocation in the hot path (for example, resolving PK header indices via direct scan for small PK sets or reusing cached indices when safe). Guardrails:

  • preserve exact PK matching semantics and row-selection determinism,
  • preserve behavior when rows are short or fields are missing,
  • preserve header-first output shape (records[0] retained),
  • avoid introducing stale index reuse when schema/header shape changes.

Avoid rebuilding column header lookup maps for each projection request

In bus-api/internal/server/backend.go, selectColumns rebuilds a map[string]int header index and resolves column names on every call. For repeated column= query shapes, this lookup work is repeated even when the header and selected columns are unchanged.

Benchmarks in bus-api/internal/server/perf_bench_test.go on Apple M2 Max (darwin/arm64) show this shape:

  • BenchmarkSelectColumns_CurrentHeaderMapPerCall about 223915 ns/op, 896522 B/op, 4107 allocs/op
  • BenchmarkSelectColumns_PrecomputedIndices about 205672 ns/op, 893121 B/op, 4098 allocs/op

Use this benchmark loop when iterating on column-projection lookup changes:

go test ./internal/server -run '^$' \
  -bench 'BenchmarkSelectColumns_' \
  -benchmem

Preferred optimization direction is to precompute/reuse column index slices for repeated projection shapes rather than rebuilding name-index maps every request. Guardrails:

  • preserve unknown-column behavior (nil/error path) exactly,
  • preserve projected column order as requested by the client,
  • preserve fill-with-empty-string behavior for short rows,
  • avoid stale index reuse when header/schema changes.

Avoid full-slice path splitting for resource route dispatch

In bus-api/internal/server/handlers.go, handleResourceScoped currently uses strings.Split on the full resource subpath to get the resource name and first segment. This allocates a full segment slice per request even though dispatch mostly needs only the first and second segments.

Benchmarks in bus-api/internal/server/perf_bench_test.go on Apple M2 Max (darwin/arm64) show this shape:

  • BenchmarkResourceScopedPathParse_CurrentSplit about 46.02 ns/op, 48 B/op, 1 allocs/op
  • BenchmarkResourceScopedPathParse_CutBased about 8.265 ns/op, 0 B/op, 0 allocs/op

Use this benchmark loop when iterating on resource-route parsing changes:

go test ./internal/server -run '^$' \
  -bench 'BenchmarkResourceScopedPathParse_' \
  -benchmem

Preferred optimization direction is cut/index-based parsing (strings.Cut or equivalent) that extracts only required segments without allocating full split slices. Guardrails:

  • preserve exact routing semantics for /resources/{name} and nested paths,
  • preserve 404 behavior for empty/malformed segment shapes,
  • preserve URL unescape and row-key dispatch behavior for nested handlers,
  • keep deterministic method dispatch and error responses unchanged.

Avoid full-slice path splitting in module adapter route dispatch

In bus-api/internal/backends/stub.go, module route handlers currently use strings.Split for request path parsing in dispatch methods such as handleWorkflowPath, handleEntityPath, handleSemanticPath, and handleResourcePath. These handlers only need the first few segments, so splitting the whole path allocates unnecessary slices on every request.

Benchmarks in bus-api/internal/backends/perf_bench_test.go on Apple M2 Max (darwin/arm64) show this shape:

  • BenchmarkModuleAdapterSemanticPathParse_CurrentSplit about 42.02 ns/op, 48 B/op, 1 allocs/op
  • BenchmarkModuleAdapterSemanticPathParse_CutBased about 12.89 ns/op, 0 B/op, 0 allocs/op

Use this benchmark loop when iterating on module adapter route parsing changes:

go test ./internal/backends -run '^$' \
  -bench 'BenchmarkModuleAdapterSemanticPathParse_' \
  -benchmem

Preferred optimization direction is to parse only needed leading segments with strings.Cut/index-based extraction instead of full Split. Guardrails:

  • preserve exact method dispatch and 404/405 behavior for module endpoints,
  • preserve percent-decoding behavior for aliases/workflow names/row keys,
  • preserve deterministic error payload shapes and status codes,
  • preserve path handling for both canonical and edge-case segment shapes (empty, trailing slash, nested tails).