Kubernetes : Torus, a cloud native distributed file system

Cet article est la partie 2 de la serie "Kubernetes deep dive" :

  1. Kubernetes on AWS with kube-aws
  2. Kubernetes : Torus, a cloud native distributed file system
  3. Kubernetes : Ingress Controller with Træfɪk and Let's Encrypt
  4. Kubernetes : Træfɪk and Let's Encrypt at scale
thumbernail Kubernetes

coreos

Still in the Kubernetes series, this week let's take a look at Torus, a cloud native distributed file system developed by CoreOS which give persistent storage to Kubernetes PODs.

Qu'est-ce que Torus ?

In the container world, persistent storage is still a major issue : stateless application are easily moved to container based environment and do not usually need persistent data. This is not the case of stateful applications such as databases. For Docker, there are some plugins available already :

  • GlusterFS
  • NFS
  • Convoy (abstracts GlusterFS and NFS backends to Docker)

For Kubernetes volumes, several backends are supported :

  • iSCSI
  • GlusterFS
  • rbd (Ceph)
  • NFS
  • Volumes based on Cloud Provider (GCE PD, AWS EBS, Azure Volume)

With the exception of cloud prodiver backends, most are difficult to deploy (like ceph) or to scale in an container based infrastructue composed of small and expendables instances.

In June 2016, CoreOS annonced a new distributed file system : Torus, whose goal is to serve as cloud native distributed storage.

Torus

In the spirit of micro services and GIFEE (Google Infrastructure For Everyone Else), Torus focuses on the storage : Metadata and consensus are enforced by etcd : A highly scalable key/value store used by Kubernetes or Docker Swarm. Torus supports all the features of modern file system such as replication and load balancing of data.

What are the differences with GlusterFS or NFS ? Torus provides only block storage for now via the NBD protocol (Network Block Device). By design, it will be possible to expose other storage types like object storage or file system. Torus is written in Go and statically linked, which makes it easily portable into containers. Torus is split into 3 binaries :

  • torusd : Torus daemon
  • torusctl : CLI to manage Torus
  • torusblk : CLI to manage Torus Block Devices

Kubernetes integration with Flex Volume

Torus is independent from Kubernetes. It can be installed as a standalone application and expose volumes via NBD. These volume can then be used like any other block device.

Kubernetes, in addition to built-in volumes, has a FlexVolume plugin that allows custom storage solution to implements a storage drivers without touching the heart of Kubernetes. torusblk binary is compatible with FlexVolume specs so we can directly mount storage into PODs.

Demo on a Kubernetes cluster with etherpad-lite

This test takes place on Kubernetes 1.3 cluster with 3 CoreOS nodes. We are going to deploy Torus as a container, directly on Kubernetes. To work properly Torus needs :

  • NBD kernel module loaded
  • Torus binary into the kubelet plugin directory
  • Etcd v3
  • Free storage on the nodes

Plugin installation

Torus binaries are available here.

First let's check Kubelet's config :

# /etc/systemd/system/kubelet.service
[Service]
ExecStartPre=/usr/bin/mkdir -p /etc/kubernetes/manifests

Environment=KUBELET_VERSION=v1.3.0_coreos.0
ExecStart=/usr/lib/coreos/kubelet-wrapper \
  --api-servers=http://127.0.0.1:8080 \
  --network-plugin-dir=/etc/kubernetes/cni/net.d \
  --network-plugin= \
  --register-schedulable=false \
  --allow-privileged=true \
  --config=/etc/kubernetes/manifests \
  --hostname-override=192.168.122.110 \
  --cluster-dns=10.3.0.10 \
  --cluster-domain=cluster.local \
  --volume-plugin-dir="/etc/kubernetes/kubelet-plugins/volume/exec/"
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target

We need to add volume-plugin-dir if not already present. Default is --volume-plugin-dir="/usr/libexec/kubernetes/kubelet-plugins/volume/exec/" but /usr is in read-only in CoreOS.

Then we need to create the directory and drop torusblk binary as torus on each node /etc/kubernetes/kubelet-plugins/volume/exec/coreos.com~torus/ :

ansible -i ../ansible-coreos/inventory "worker*" -m file -b -a "dest=/etc/kubernetes/kubelet-plugins/volume/exec/coreos.com~torus mode=700 state=directory"
ansible -i ../ansible-coreos/inventory "worker*" -m copy -b -a "src=/home/klefevre/go/src/github.com/coreos/torus/bin/torusblk dest=/etc/kubernetes/kubelet-plugins/volume/exec/coreos.com~torus/torus mode=700"

Torus also needs nbd kernel module. It can be loaded manually :

ansible -i ../ansible-coreos/inventory -m shell -b -a "modprobe nbd" "worker*"

Or with cloud-init :

#cloud-config
write_files:
  - path: /etc/modules-load.d/nbd.conf
    content: nbd
coreos:
  units:
    - name: systemd-modules-load.service
      command: restart

Finally we need to restart the Kubelet service on each node and we are done for the prerequisites. The remaining steps take place on Kubernetes :

ansible -i inventory -m shell -b -a "systemctl restart kubelet" "worker*"

Etcdv3 deployment

To work properly, Torus needs the latest etcd v3. It can be deploy on Kubernetes and publish with a service :

# etcdv3-daemonset.yml
---
apiVersion: v1
kind: Service
metadata:
  labels:
    name: etcdv3
  name: etcdv3
spec:
  type: NodePort
  clusterIP: 10.3.0.100
  ports:
    - port: 2379
      name: etcdv3-client
      targetPort: etcdv3-client
  selector:
    daemon: etcdv3
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: etcdv3
  labels:
    app: etcdv3
spec:
  template:
    metadata:
      name: etcdv3
      labels:
        daemon: etcdv3
    spec:
      containers:
      - name: etcdv3
        image: quay.io/coreos/etcd:latest
        imagePullPolicy: Always
        ports:
        - name: etcdv3-peers
          containerPort: 2380
        - name: etcdv3-client
          containerPort: 2379
        volumeMounts:
        - name: etcdv3-data
          mountPath: /var/lib/etcd
        - name: ca-certificates
          mountPath: /etc/ssl/certs
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: ETCD_DATA_DIR
          value: /var/lib/etcd
        - name: ETCD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: ETCD_INITIAL_ADVERTISE_PEER_URLS
          value: http://$(POD_IP):2380
        - name: ETCD_ADVERTISE_CLIENT_URLS
          value: http://$(POD_IP):2379
        - name: ETCD_LISTEN_CLIENT_URLS
          value: http://0.0.0.0:2379
        - name: ETCD_LISTEN_PEER_URLS
          value: http://$(POD_IP):2380
        - name: ETCD_DISCOVERY
          value: https://discovery.etcd.io/         #-> Get discovery URL before launching
      volumes:
      - name: etcdv3-data
        hostPath:
          path: /srv/etcdv3
      - name: ca-certificates
        hostPath:
          path: /usr/share/ca-certificates/

Usage of DeamonSet ensure that there is one instance of etcd per node. Etcd cluster state can be validated with etcdctl and the Kubernetes service IP :

kubectl create -f etcdv3-daemonset.yml

kubectl get pods --selector="daemon=etcdv3"
NAME           READY     STATUS    RESTARTS   AGE
etcdv3-4jncl   1/1       Running   1          1d
etcdv3-7r46s   1/1       Running   1          1d
etcdv3-o0n86   1/1       Running   1          1d
core@coreos00 ~ $ etcdctl --endpoints=http://10.3.0.100:2379 cluster-health
member 3f87be33d946732d is healthy: got healthy result from http://10.2.78.2:2379
member a19d841707579e60 is healthy: got healthy result from http://10.2.36.3:2379
member d5479de5c3342460 is healthy: got healthy result from http://10.2.55.3:2379
cluster is healthy

core@coreos00 ~ $ etcdctl --endpoints=http://10.3.0.100:2379 member list
3f87be33d946732d: name=etcdv3-7r46s peerURLs=http://10.2.78.2:2380 clientURLs=http://10.2.78.2:2379 isLeader=true
a19d841707579e60: name=etcdv3-4jncl peerURLs=http://10.2.36.3:2380 clientURLs=http://10.2.36.3:2379 isLeader=false
d5479de5c3342460: name=etcdv3-o0n86 peerURLs=http://10.2.55.2:2380 clientURLs=http://10.2.55.3:2379 isLeader=false

Torus deployment

Torus is also deployed on top of Kubernetes, also with a DaemonSet so there is one instance of Torus on each node. That instance is using local storage to populate Torus storage pool (with the host volume).

# torus-daemonset.yml
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: torus
  labels:
    app: torus
spec:
  template:
    metadata:
      name: torus
      labels:
        daemon: torus
    spec:
      containers:
      - name: torus
        image: quay.io/coreos/torus:latest
        imagePullPolicy: Always
        ports:
        - name: torus-peer
          containerPort: 40000
        - name: torus-http
          containerPort: 4321
        env:
        - name: ETCD_HOST
          value: 10.3.0.100                         #-> Etcd service clusterIP
        - name: STORAGE_SIZE
          value: 5GiB                               #-> Storage pool maximum size
         - name: LISTEN_HOST
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: AUTO_JOIN
          value: "1"
        - name: DEBUG_INIT
          value: "1"
        - name: DROP_MOUNT_BIN
          value: "0"
        volumeMounts:
          - name: torus-data
            mountPath: /data
      volumes:
        - name: torus-data
          hostPath:
            path: /srv/torus                        #-> HostPath with torus block device
kubectl create -f etcdv3-daemonset.yml

kubectl get pods --selector="daemon=torus"
NAME          READY     STATUS    RESTARTS   AGE
torus-56p79   1/1       Running   0          1m
torus-87f9t   1/1       Running   0          1m
torus-wc54r   1/1       Running   0          1m

On a machine running Kubectl, we can control Torus cluster with the binaries we saw previously. To do so we use the port forward feature of Kubectl to access an etcd pod :

kubectl port-forward etcdv3-4jncl

That makes it possible to locally control Torus cluster :

torusctl list-peers
Handling connection for 2379
ADDRESS                 UUID                                  SIZE     USED  MEMBER  UPDATED        REB/REP DATA
http://10.2.55.2:40000  dce8cf72-45f5-11e6-9426-02420a023702  5.0 GiB  0 B   OK      4 seconds ago  0 B/sec
http://10.2.78.3:40000  dd53432a-45f5-11e6-8fec-02420a024e03  5.0 GiB  0 B   OK      4 seconds ago  0 B/sec
http://10.2.36.4:40000  dd800d8c-45f5-11e6-b812-02420a022404  5.0 GiB  0 B   OK      2 seconds ago  0 B/sec
Balanced: true Usage:  0.00%

So in our case, we have 3 instances with each a 5GiB pool, like we specified into Torus' manifest. We create a 1GiB volume for our etherpad app :

torusctl volume create-block pad 1GiB
torusctl volume list
Handling connection for 2379
VOLUME NAME  SIZE     TYPE
pad          1.0 GiB  block

The volume pad is now available has a Kubernetes volume.

Etherpad deployment

Like the other, etherpad is deploy on top of Kubernetes with a deployment and a service :

# deployment-etherpad.yml
---
apiVersion: v1
kind: Service
metadata:
    name: etherpad
    labels:
        app: etherpad
spec:
    selector:
        app: etherpad
    ports:
        - port: 80
          targetPort: etherpad-port
    type: NodePort
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: etherpad
  labels:
    app: etherpad
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: etherpad
    spec:
      containers:
      - name: etherpad
        image: "osones/etherpad:alpine"
        ports:
        - name: etherpad-port
          containerPort: 9001
        volumeMounts:
        - name: etherpad
          mountPath: /etherpad/var
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
      volumes:
        - name: etherpad
          flexVolume:
            driver: "coreos.com/torus"              #-> FlexVolume driver
            fsType: "ext4"
            options:
              volume: "pad"                         #-> Name of the previously created Torus volume
              etcd: "10.3.0.100:2379"               #-> Etcd service clusterIP

To check the deployment :

kubectl create -f ymlfiles/deployment-etherpad.yml
deployment "etherpad" created
service "etherpad" created

kubectl describe service etherpad
Name:                   etherpad
Namespace:              default
Labels:                 app=etherpad
Selector:               app=etherpad
Type:                   NodePort
IP:                     10.3.0.157
Port:                   <unset> 80/TCP
NodePort:               <unset> 31594/TCP
Endpoints:              10.2.55.4:9001
Session Affinity:       None
No events.

To simplify the demo, etherpad service is published on a NodePort, it means that we can join the service on every node at port 31594.

To verify data persistence we are going to create a pad named "Torus" :

etherpad-pod1

To test, we are going to label a node as unschedulable (cordon in Kubernetes language), then destroy the etherpad pod that will be automatically reschedule on another node :

kubectl describe pods etherpad-2266423034-6wk4u
Name:           etherpad-2266423034-6wk4u
Namespace:      default
Node:           192.168.122.111/192.168.122.111
Start Time:     Sat, 09 Jul 2016 19:11:23 +0200
Labels:         app=etherpad
                pod-template-hash=2266423034
Status:         Running
IP:             10.2.55.4
Controllers:    ReplicaSet/etherpad-2266423034
...

kubectl cordon 192.168.122.111
node "192.168.122.111" cordoned

kubectl get nodes
NAME              STATUS                     AGE
192.168.122.110   Ready                      4d
192.168.122.111   Ready,SchedulingDisabled   4d
192.168.122.112   Ready                      4d

Scheduling is disable on node 192.168.122.111. Let's detroy etherpad pod :

kubectl get pods --selector="app=etherpad"
NAME                        READY     STATUS    RESTARTS   AGE
etherpad-2266423034-6wk4u   1/1       Running   0          14m

kubectl delete pods etherpad-2266423034-6wk4u
pod "etherpad-2266423034-6wk4u" deleted

Check the new pod :

kubectl get pods --selector="app=etherpad"
NAME                        READY     STATUS    RESTARTS   AGE
etherpad-2266423034-urb2f   1/1       Running   0          1m

kubectl describe pods --selector="app=etherpad"
Name:           etherpad-2266423034-urb2f
Namespace:      default
Node:           192.168.122.112/192.168.122.112
Start Time:     Sat, 09 Jul 2016 19:25:46 +0200
Labels:         app=etherpad
                pod-template-hash=2266423034
Status:         Running
IP:             10.2.36.5
Controllers:    ReplicaSet/etherpad-2266423034

etherpad-pod2

Ok, it is just screen capture so I invite you to trust me on this one or test by yourself :)

Conclusion

Torus is a young product that fits right into the philosophy behind other OSS products launched by CoreOS.

Even if block storage has little interest on a Kubernetes supported cloud provider (Kubernetes already supports PD on GCE, EBS on AWS, Cinder on OpenStack), if you are running on premises or on an unspported cloud provider, Torus can be used to aggregate the storage of multiple instances at no cost.

Finally, about GlusterFS and/or NFS, Torus is not at the same level as only block storage is available for now.

Let's see how the product is going to evolve and also if it will be meeting the same kind of enthusiasm as other CoreOS products.

Kevin Lefevre

Découvrez les technologies d'alter way