Skip to content
Snippets Groups Projects

Infrastructure

Introduction

Welcome to my infrastructure repository. The goal of this repo is to document and make freely available the configuration and deployment tooling used to support most of my personal computer infrastructure, from networking to servers and applications.

This is my journey of constant experimentation in learning new tools and technologies to manage the lifecycle of infrastructure and applications in a way that burdens as little as possible while letting me iterate quickly and safely. The landscape is constantly evolving and I try to stay up to date with best practices as I see fit.

The Landscape

I use a few different tools and methodologies to achieve my goals outlined above, here are some important ones.

🔧 Tools

Tool Description

Nix

Declarative package manager providing a development environment for this repo.

NixOS

Declarative Nix-based linux distribution I run kubernetes and other services on.

VyOS

Declaratively-configured network OS for all of my routing/firewalling needs.

TrueNAS Core

Rock solid storage operating system for my NASes powered by OpenZFS.

Terraform

Infrastructure as code tool I use to manage all of my cloud resources.

MinIO

S3-compatible object storage running on TrueNAS I use to store my Terraform state files.

Kubernetes

Production-grade container orchestrator I use to run most of my services.

Argo Project

Open source tools for Kubernetes I use to run workflows, manage my cluster, and do GitOps right.

GitLab

My self-hosted Git collaboration software of choice, deployed by and hosting this repository.

SOPS

SOPS allows me to encrypt sensitive information in this repository at rest and decrypt it at runtime.

🧠 Methodologies

Methodology Description

GitOps

Versioned, Immutable, Declarative configuration being automatically applied and continuously reconciled.

🖥️ Inventory

Listing of relevant hardware configuration in this repository is applied to.

Hardware Hostname Purpose Specifications

HP T730

235-gw

Internet Gateway

8GB RAM, 2x Ge

HP T620

260-gw

Internet Gateway

4GB RAM, 4x Ge

HP T730

305-1700-gw

Internet Gateway

8GB RAM, 2x Ge

HPE SL250S G8

soarin

NixOS Host #1

2x E5-2670, 128 GB RAM

HPE DL380p G8

sassaflash

NixOS Host #2

2x E5-2670, 128 GB RAM

HPE DL380p G8

stormfeather

NixOS Host #3

2x E5-2670, 128 GB RAM

Supermicro X9

spitfire

NAS #1

2x E5-2620, 128 GB RAM, 132 TB ZFS

Supermicro X9

firestreak

NAS #2

2x E5-2620, 64 GB RAM, 24 TB ZFS

Brocade ICX-6610

Router

Switch/Router

2x 40Ge, 16x 10Ge, 48x 1Ge

Hetzner EX-52

bedrock

Cloud Dedicated Server

i7 8700, 128 GB RAM

OVH VPS

stone

Cloud VPS

1 vCore , 2 GB RAM

🌐 Network Topology

My network is quite vast and spans many physical sites. The network diagram below will give you a good visual feel for it. All of my internet gateways are re-purposed thin clients running the purpose-built VyOS Linux distribution. My entire network is dual-stacked running native IPv4 and native IPv6 where possible with fallback to tunnel brokers. I interconnect my physical sites using WireGuard tunnels and run OSPFv2/3 over those tunnels to propagate routing tables. My core Brocade switch connects my kubernetes cluster to the internet via BGP using the Calico CNI on the kubernetes side to advertise all of the relevant subnets living in the cluster. Everything on the kubernetes side is pure layer-3 networking, there are no overlays or tunnels between nodes. Finally, my network is divided into several VLANs to isolate different types of traffic, such as home traffic, lab traffic, storage traffic, kubernetes traffic and OOB management traffic.

I am currently in the process of obtaining a personal ASN as well as my own PA and possibly PI IPv6 address space as part of a project to learn more about BGP and more importantly to avoid relying on unstable residential ISP-supplied PA prefixes. This will probably involve running BGP sessions on a VPS from a cloud provider that allows BGP peering and tunnelling that home. Unless I can convince my residential ISPs to let me establish a BGP session with them but that it more likely to be a pipe dream.

🗺️ Diagrams

Click to expand
diagram
Figure 1. Example diagram; will change

❄️ Nix

To be complete, this repo needs to be as self-contained as possible. What I mean by this is that to manipulate or apply any kind of configuration in this repo, you shouldn’t need anything on your system other than git to clone it and the Nix package manager. The repo contains a flake at the root, flake.nix which defines apps and development shells that are immediately accessible via Nix and don’t perform or require any alteration to the state of the system the repo is cloned onto. The development shell makes available to the user all of the CLI tools required to interact with the repository. It also decrypts and makes available in the environment all of the relevant secrets required to authenticate with the infrastructure, given that the user can provide private key material to decrypt them.

To enter the development shell, either have direnv configured for your shell and run direnv allow in the root of the project or else run nix develop. You may need to enter your PIN and touch your Yubikey to decrypt secrets during the shell hook execution.

The Repository itself

📂 Repository structure

This listing intentionally doesn’t cover the entire structure of the repo, it is meant to be a high-level overview of where various components are located.

./
├── dns/            # Terraform files for misc. dns zones and records
├── k8s-235/        # All files related to my kubernetes cluster
│  └── apps/        # Applications deployed on k8s using ArgoCD and Kustomize
├── nixos/          # All Nix modules and configurations for my NixOS hosts including kubernetes
├── secrets/        # Secrets encrypted with SOPS
└── flake.nix       # The Nix flake containing all tools and development shells to work with the contents of this repo

🔐 SOPS

You might notice at a glance that it appears that this repository is full of secret things like API keys and stuff that shouldn’t be publicly available. That is partially correct. This repository contains a variety of sensitive values but none of them are committed in plain text. Instead, they are encrypted with SOPS, which protects them from prying eyes at rest. However, with the right decryption keys, these secrets can be decrypted at runtime. Activating this project’s nix shell using nix develop or via direnv will automatically decrypt the secrets and place them in the appropriate environment variable to be used by tools and config in the repo.

The .sops.yaml file at the root of the repo defines creation rules for secrets to be encrypted with sops. Any files matching the defined creation rule paths will be encrypted with the PGP fingerprints specified. The user encrypting new files mut have all of the PGP public keys included in their keyring in order to encrypt the secret.

Encrypting with SOPS

To encrypt a new file or edit an existing one in $EDITOR, run sops <filename>. To encrypt a file in place, run sops -i -e <filename>.

New files must match the creation rules in .sops.yaml or they will fail to create.

Re-keying with SOPS

Occasionally, we might want to add a collaborator to the project, or a new tool that needs to be able to decrypt secrets. In that case, it will be necessary to re-key all existing files to encrypt them with the new public key after adding it to .sops.yaml. The following snippet can be used for this purpose.

Re-key snippet
#!/usr/bin/env bash
find **/secrets/*.yaml -name '*.yaml' | xargs -i sops updatekeys -y {}
find k8s/**/secrets/*.yaml -name '*.yaml' | xargs -i sops updatekeys -y {}

DNS

Overview

I try to keep all configuration related the to deployment of a particular "application" or "unit" together. This is why, for example, many deployments in the k8s-235/apps folder have their own terraform folders for DNS and other cloud configuration. For underlying infrastructure, this is less obvious so the dns folder is a collection of DNS zones and records that don’t exactly fit with a particular project and are all applied in one collection of terraform.

Zones

The following zones, including their DNSSEC configuration are managed by this collection:

  • cgbpi.com

  • nerdsin.space

  • shitsta.in

  • tdude.co

  • trs.tn

  • as208914.net

☁️ Making changes

These DNS zones are managed by cloudflare and changes are performed with terraform using the official cloudflare provider.

Procedure to apply DNS changes

TODO: document

create bucket in minio init terraform plan terraform apply terraform

235 Kubernetes cluster

Overview

The 235 kubernetes cluster (referred to as k8s-235) is my only personal kubernetes cluster at the moment. It consists of 3 consolidated master, worker and etcd nodes. Etcd is the backing data store for kubernetes. The cluster is deployed on 3 physical (bare metal) servers running NixOS, these hosts are provisioned by the Nix expressions and modules located in the nixos folder at the root of the repo. The kubernetes cluster deployed using a combination of the kubernetes module included in upstream nixpkgs and a custom module that I wrote specific to my infrastructure. Most importantly, it handles end-to-end mutual TLS certification of all cluster components using Hashicorp Vault and vault-agent.

Provisioning

The physical servers providing compute for the kubernetes worker and master nodes are provisioned by Nix. The flake in this repo contains NixOS configurations for the pysical servers soarin, sassaflash and stormfeather.

Procedure to install a new physical NixOS server

TODO: document

Procedure to initialize, backup and restore the etcd database

TODO: document

Components

The core components of the kubernetes cluster include the control-plane itself, etcd, the Cilium CNI networking plugin and the democratic-csi CSI storage plugin. Cilium is installed at cluster provisioning time using a manual invocation and democratic-csi is later installed by ArgoCD. Cilium can then be further upgraded by ArgoCD.

🐝 Cilium

I used to use the Calico CNI but recently migrated to Cilium. The semantics are mostly the same when it comes to the BGP peering architecture, I hope to post updated documentation soon.

🐙 ArgoCD

ArgoCD is an central part of my cluster. Once the initial control-plane for the cluster is deployed by ansible, the only application that must be installed manually into the cluster is ArgoCD. Once installed, it will take over the lifecycle of all of the applications in this repository to be deployed into the cluster. ArgoCD deploys everything in the k8s/apps folder including itself. A commit webhook in GitLab will cause ArgoCD to sync the repository on every commit. Authentication to ArgoCD is done through GitLab’s SSO.

To deploy ArgoCD, go into the k8s/apps/argocd and run kustomize build --enable-alpha-plugins overlays/prod | kubectl apply -f -. ArgoCD will become available at https://gitops.tdude.co.

Kustomize

Kustomize touts itself to be a template-free solution to manage application deployments on kubernetes in stark contrast to the popular Helm package manager. All of the manifests in k8s/apps are assembled with kustomize before being applied. ArgoCD runs kustomize which builds the manifests and applies them to the cluster.

SOPS

ArgoCD is capable of decrypting secrets in this repo through the kustomize-sops plugin for kustomize. To do this, an initContainer installs the plugin into ArgoCD’s Pod and also imports a GPG private key ArgoCD uses to decrypt secrets. All sops-encrypted secrets you expect ArgoCD to be able to decrypt must be encrypted with ArgoCD’s public key.

Procedure to generate a GPG key for use with ArgoCD

Procedure goes here…​

🔥 Prometheus

Cluster-wide metrics monitoring in my cluster is performed by prometheus. More specifically, I use ArgoCD to deploy my personal configuration of the kube-prometheus project. Like everything else, it is deployed by ArgoCD and configured to scrape ServiceMonitors and PodMonitors in all namespaces, along with cluster-wide internal metrics. The web interface for prometheus is reachable at https://prometheus.monitoring.tdude.co and the one for Grafana at https://monitoring.tdude.co. Grafana is configured for SSO authentication with GitLab.

Additionally, kube-prometheus also deploys alertmanager and a handful of useful alerting rules. I have configured it to notify me on Discord and on Matrix when alerts are firing using alert-manager-discord and matrix-alertmanager. The web interface for alertmanager is available at https://alertmanager.tdude.co.

🛩️ Traefik

The idiomatic way to expose http services in kubernetes is via an ingress controller, a fancy word for a reverse proxy that does some service discovery. For this task, I use Traefik for no particular reason other than having used it in the past and having liked it.

Single Sign-On

Traefik supports a host of interesting features like its extensive middleware system. I make extensive use of the "Forward Auth" middleware to protect services I don’t want exposed to the public behind an OpenID Connect SSO login via oauth2-proxy. SSO protection can be added to any ingress by setting the traefik.ingress.kubernetes.io/router.middlewares label on it to traefik-forward-auth@kubernetescrd.

TLS

Traefik is configured to accept TLS 1.2 and 1.3 connections only and sets strict-transport security headers unconditionally. While it is capable of obtaining TLS certificates on its own via ACME, I choose to disable that functionality and provision TLS certificates with cert-manager instead because it is much more flexible. Setting the cert-manager.io/cluster-issuer label to "letsencrypt-prod" on any ingress will cause cert-manager to provision a certificate matching the domains on that ingress. Cert-manager is configured to obtain certificate from the Let’s Encrypt staging and production environments via the DNS-01 challenge using Cloudflare’s API.

💾 Democratic-csi

No kubernetes cluster is complete without persistent storage for stateful applications. My hypervisors aren’t particularly interesting in terms of storage, but I do have a NAS with plenty of storage so let’s use that! I use the democratic-csi CSI storage driver in my cluster to provide network programmatic access to network storage on my NAS to my kubernetes nodes. This works by making calls to the TrueNAS http api to provision new ZFS datasets or ZVOLs and exporting it over NFS or iSCSI depending on the type of share. The configuration in k8s/apps/democratic-csi defines 3 distinct StorageClasses for different purposes:

  • freenas-nfs-csi

  • freenas-iscsi-csi

  • truenas-nfs-spitfire-fast

freenas-nfs-csi provisions HDD-backed ZFS datasets to be shared over NFS. freenas-iscsi-csi provisions HDD-backed ZFS datasets to be shared over iSCSI, this is usually required for anything that uses an SQLite database with the WAL enabled since locking does not work over NFS. truenas-nfs-spitfire-fast provisions an NVMe SSD-backed ZFS dataset to be shared over NFS.

Configuration

To in order for democratic-csi to do its work, we need to supply it with a TrueNAS API key and an SSH private key to log into TrueNAS as root to run some ZFS commands (this might not be required in the future).

Procedure to add an SSH key to TrueNAS root

Procedure goes here…​

Procedure to generate and retrieve a TrueNAS API key

Procedure goes here…​