etcd is a very complex yet amazing piece of software which accomplish the seemingly dumb task of keeping key-value pairs, both in memory and on disk - in this post, I’m tackling the disk persistence layer.

etcd goes to great lengths to make sure your data is perfectly consistent and readily available when needed, and if configured in cluster mode it makes sure that each node has a copy of said data thanks to the Raft consensus protocol.

While this idea works great for standard key-value storage applications, I believe there could be a better way of storing sensitive data using the framework that etcd already has in place for on-disk retention.

This intuition drove me to a very cumbersome, yet interesting way to tackle this problem: data delocalization: instead of storing all your data in the same bucket, move all the sensitive stuff (Kubernetes secrets, TLS private keys…) in a separate physical layer backed by some sort of trusted environment.

This of course isn’t anything new in Computer Science, but I deemed interesting applying this idea to etcd.

Technology stack

To spice up the challenge I decided to play a little more with OP-TEE, an operating system for the ARM TrustZone environment: no x86 involved (yet).

OP-TEE is a fully-featured open-source Trusted Execution Environment (TEE) for the ARM platform, with a nice little community made up of experts and users.

Even though etcd doesn’t list ARM and ARM64 in its compatibility list, it can be forcefully cross-compiled for them.

etcd checks what architectures it is running on and refuses to start if ETCD_UNSUPPORTED_ARCH is not set.

It looks dirty, but it works.

The guinea pigs employed during this experimentations are:

  • USB Armory MkII: i.MX6ULZ SoC, 512MB RAM, ARM
  • 96Boards Poplar: HiSilicon Hi3798CV200 SoC, 1GB RAM, ARM64

Both boards ran the same version of OP-TEE (bootstrapped either via ARM Trusted Firmware-A or U-Boot), same Linux distro (Debian 10) but different Linux kernel versions.

I followed a bottom-up approach, starting from the fundamentals and working my way up to the user-facing objects.

Trusted Environment Overview

The trusted environment in this context is realized by the presence of a TEE on the system running etcd, OP-TEE.

OP-TEE by itself does not provide storage facilities or any kind of high-level functions, but rather implements the GlobalPlatfom TEE standard API on which developers build abstract, specific interfaces.

Those interfaces are materialized by means of Trusted Applications, bits of software that run in the TEE.

OP-TEE executes Trusted Applications (TAs) in an area of memory dubbed TZ_DRAM, which is completely invisible to the operating system: not even the kernel can read/write/execute it.

Better-equipped SoCs, like the i.MX6ULZ, provide memory firewalls embedded in the silicon which help locking down memory access even further.

Our TEE OS makes use of those firewalls if present.

The clever people working on OP-TEE decided to reduce technical complexity and leave boring stuff like disk driver/filesystem implementation to the non-trusted OS.

They wrote a Linux kernel driver which acts as a bridge between the two worlds:

  • OP-TEE loads TAs from the Linux non-trusted environment into its memory and then executes them
  • whenever a TA wants to load or a file, its content crosses the environments and gets read/written on disk by Linux

In a fully-secure environment OP-TEE can be configured to use RPMB eMMC access to further protect disk access against replay attacks, and to check TAs binary signature.

Diving into TAs storage facilities

Trusted Application write on disk through the Linux driver available in the non-trusted environment, but to guarantee privacy and consistency between the TA and its backing storage, OP-TEE adopts a simple idea: encrypt everything, always.

For each TA, a unique encryption key is derived from the a Hardware Encryption Key (HUK), unique for each SoC/device.

This means the data stored by each single TA is safe against:

  • a rogue TA which wants to eavesdrop another TA storage
  • an external user-side component trying to decrypt its storage

OP-TEE makes sure to encrypt your data before passing it to the non-trusted environment, making sure all you could spy on by watching e.g. the storage bus is just random-looking bytes.

From planning to action

Giving those assumptions as solid enough, I started working on a storage prototype: secs1.

This prototype implements the bare minimum operations needed to read/write data.

The following components have been developed for the prototype:

  • a TA implementing secure storage facilities
  • a Go library

The TA and the accompanying userland library are written in C, which made this project even more interesting!

Communication between the trusted and non-trusted worlds happen in a RPC-like pattern, except that the boundaries to cross aren’t geographical but in-silicon ones.

This is the userland library interface:

TEEC_Result query(const char* key, uint32_t* val_sz);
TEEC_Result get(const char* key, void* dest, uint32_t dest_sz);
TEEC_Result set(const char* key, const char* val);

Under the hood, each key gets hashed and then used for its operation aim:

TEE_Result res = TEE_ERROR_GENERIC;

// sha256 output is 256 bits
size_t key_hash_size = 32;
char *key_hash =
    TEE_Malloc(key_hash_size * sizeof(char), TEE_MALLOC_FILL_ZERO);
if (!key_hash) {
	EMSG("could not allocate enough memory for digest output");
	return TEE_ERROR_OUT_OF_MEMORY;
}

res = do_sha(key, key_hash);
if (res != TEE_SUCCESS) {
	EMSG("could not do sha: %d", res);
	TEE_Free(key_hash);
	return res;
}

// use key!

TEE_Free(key_hash);
	return res;
}

Storing a key-value pair is trivial:

char *value = "world!";
char *key = "hello";

TEEC_Result res = set(key, value);
if (res != TEEC_SUCCESS) {
	// bad stuff happened!
	return 1;
}

Fetching data from the secure store isn’t as trivial, but still easy nonetheless.

Since we’re dealing with C and manual memory we must be able to know the size of our payload beforehand, to then allocate a buffer big enough to hold the data.

To do so we must query the store TA for our key metadata, and then ask it nicely to retrieve the associated content for us:

char *key = "hello";
uint32_t data_sz = 0;

TEEC_Result res = query(key, &data_sz);
if (res != TEEC_SUCCESS) {
	// bad stuff happened!
	return 1;
}

// we now know how big our buffer must be
char *value = (char*) malloc(data_sz * sizeof(char));
if (!value) {
	// more bad stuff!
	return 1;
}

TEEC_Result res = get(key, value, data_sz);
if (res != TEEC_SUCCESS) {
	// once more, bad stuff!
	return 1;
}

printf("%s: %s\n", key, value);
free(value);

The Go interface is quite a bit easier to use for its consumer, and wraps all the complexity behind just two methods:

func query(key string) (C.uint, error) {
	ck := C.CString(key)
	defer C.free(unsafe.Pointer(ck))

	var ci C.uint
	ret := C.query(ck, &ci)
	if ret != 0 {
		return 0, fmt.Errorf("TA error: 0x%x", ret)
	}

	return ci, nil
}

func Get(key string) (string, error) {
	ck := C.CString(key)
	defer C.free(unsafe.Pointer(ck))

	l, err := query(key)
	if err != nil {
		return "", err
	}

	dest := make([]byte, uint32(l))

	ret := C.get(ck, unsafe.Pointer(&dest[0]), l)
	if ret != 0 {
		return "", fmt.Errorf("TA error: 0x%x", ret)
	}

	return string(dest), nil
}

func Set(key, value string) error {
	ck1 := C.CString(key)
	defer C.free(unsafe.Pointer(ck1))

	ck2 := C.CString(value)
	defer C.free(unsafe.Pointer(ck2))

	ret := C.set(ck1, ck2)
	if ret != 0 {
		return fmt.Errorf("TA error: 0x%x", ret)
	}

	return nil
}

Since this is a cgo-enabled library, and because the C parts must link against the OP-TEE bridging library, another small bit of generated Go code is needed to glue everything together:

package secs
 
// #cgo LDFLAGS: -L./libs -lsecs -lteec -L/home/gsora/Documents/optee/optee_client/out/export/usr/lib
// #cgo CFLAGS: -Wall -I./ta/include -I/home/gsora/Documents/optee/optee_client/out/export/usr/include -I./include 
// #include "secs.h" 
import "C"

The hardcoded paths in LDFLAGS and CFLAGS are determined by your OP-TEE SDK build artifacts, the Makefile generates those lines based on your environment configuration.

Welding secs onto etcd

Now that the secure storage has been defined, it’s time to get our hands dirty into the etcd codebase.

I had to somehow find a way to hook into the GET/SET operations, and to differentiate between secure and non-secure calls.

The first thing I did was understand the communication mechanism behind the scenes.

In a nutshell, etcd is just a gRPC service consuming requests which comply to a well-defined interface, so the obvious thing to do was add some sort of “marking” on the messages.

RangeRequest is the request sent by a client when doing a GET operations:

message RangeRequest {
  // other stuff...

  // key is the first key for the range. If range_end is not given, the request only looks up key.
  bytes key = 1;

  // other stuff...

  // secure when set does lookup on the secure storage
  bool secure = 14;
}

while PutRequest is sent for PUT operations:

message PutRequest {
  // other stuff...

  // key is the key, in bytes, to put into the key-value store.
  bytes key = 1;
  // value is the value, in bytes, to associate with the key in the key-value store.
  bytes value = 2;

  // other stuff...

  // If secure is true, key and value will be stored on the safe device.
  bool secure = 7;
}

Then it was time to actually track down where those messages are handled.

After a somewhat long CTRL+LeftClick adventure, the center of the message parsing operations was localized in the etcd/etcdserver/apply.go file, which defines the applierV3 interface:

// applierV3 is the interface for processing V3 raft messages
type applierV3 interface {
	// stuff...

	Put(txn mvcc.TxnWrite, p *pb.PutRequest) (*pb.PutResponse, *traceutil.Trace, error)
	Range(ctx context.Context, txn mvcc.TxnRead, r *pb.RangeRequest) (*pb.RangeResponse, error)
	DeleteRange(txn mvcc.TxnWrite, dr *pb.DeleteRangeRequest) (*pb.DeleteRangeResponse, error)

	// more stuff...
}

This interface is then implemented by the applierV3backend.

Since etcd is a well-written software, the mvcc package makes sure to guarantee concurrent and consistent data access under all circumstances.

Essentially for each request message handled a mvcc.Txn gets created, and then passed to the mvcc disk retention backend to be persisted on disk.

By implementing the mvcc.TxnRead and mvcc.TxnWrite interfaces on a new custom type, etcd was now able to use secs:

func (tw *tzStoreTxnWrite) Put(key, value []byte, lease lease.LeaseID) int64 {
	// set key in tz
	err := secs.Set(string(key), string(value))
	if err != nil { // haha trustzone errors go brrrrr
		panic(fmt.Errorf("cannot write on tz: %w", err))
	}
	return 1
}

func (tr *tzStoreTxnRead) Range(key, end []byte, ro RangeOptions) (r *RangeResult, err error) {
	// here, return the key lookup for key
	value, err := secs.Get(string(key))
	if err != nil {
		panic(fmt.Errorf("cannot write on tz: %w", err))
	}

	r = &RangeResult{
		Count: 1,
		KVs: []mvccpb.KeyValue{
			{
				Key:            key,
				CreateRevision: 1,
				ModRevision:    2,
				Version:        1,
				Value:          []byte(value),
				Lease:          0,
			},
		},
		Rev: 1,
	}

	return r, nil
}

After that, it was just a matter of telling applierV3backend when to switch to the secure storage instead of the main one:

func (a *applierV3backend) Put(txn mvcc.TxnWrite, p *pb.PutRequest) (resp *pb.PutResponse, trace *traceutil.Trace, err error) {
	resp = &pb.PutResponse{}

	// stuff...
	val, leaseID := p.Value, lease.LeaseID(p.Lease)
	if txn == nil {
		// stuff...
		switch p.Secure {
		case true:
			store := &mvcc.TzStore{}
			txn = store.Write(trace)
		default:
			txn = a.s.KV().Write(trace)
		}
		defer txn.End()
	}
	
	// stuff...
	return resp, trace, nil
}

func (a *applierV3backend) Range(ctx context.Context, txn mvcc.TxnRead, r *pb.RangeRequest) (*pb.RangeResponse, error) {
	resp := &pb.RangeResponse{}

	// stuff...
	if txn == nil {
		switch r.Secure {
		case true:
			store := &mvcc.TzStore{}
			txn = store.Read(trace)
		default:
			txn = a.s.KV().Read(trace)
		}
		defer txn.End()
	}

	// stuff...
	return resp, nil
}

A couple of changes here and there in the client code and in etcdctl, I was done!

Thanks to the clear separation between business logic and the distributed clustering protocol Raft, those changes translate seamlessly into a CPU architecture-hybrid paradigm.

I built a two-node etcd cluster running those changes between the Armory MkII (running ARM code) and the Poplar board (running ARM64 code) with zero issues, and since the message distribution is completely transparent to the storage layer, everything just works:

Wrapping up

There are of course performance penalties, since for each secure GET/PUT call the CPU switches between the trusted and non-trusted environments.

On a single-node instance running on the Armory MkII:

  • OP-TEE debug mode:

    • PUT: 479.425426ms
    • GET: 598.26284ms
  • OP-TEE release mode

    • PUT: 282.127417ms
    • GET: 98.775804ms

At the time of writing, hashing and encryption operations are not hardware accelerated on the i.MX6ULZ SoC, although a recent pull-request enabled this feature - more benchmarks coming soon.

This project has been a fun ride.

I love working in uncommon environments, and putting new ideas at the test in a short amount of time even more.

I believe this work is a testament to the flexibility of the ARM platform and the interesting outcomes of TEE programming in common contexts: plain old hardware security modules are secure but mostly dumb, easily-programmable secure environments are future!


  1. secure storage ↩︎