Most of the complexity of the infrastructure of the project is contained within the master nodes, intentionally to try and make repair of all other components as simple as possible. This means that there are a large number of moving parts running on the master nodes.

Redundancy - Corosync/Pacemaker

In order to assure continuous redundancy in the face of system error, two masters are provided so that they can fail over for each other in case of disaster. The main engine in charge of assuring that all processes is running smoothly is Pacemaker, built on top of Corosync as a node communication platform.

Corosync is configured to use a redundant ring protocol, so that both the backend connection and the frontend connection can be used to verify up-ness of each node. This reduces issues related to split-brain scenarios, by being able to correctly detect issues with nearby switches.

Pacemaker is built to control all subsequent tasks that are signfiicant for the masters to perform. By doing this, it ensures that these tasks are running, and if there are failures for any reason, will transfer control of them over to the other master. This is done by storing their configuration into the Cluster Information Base (CIB), which includes the configuration, and the prerequisites required for each task.

Redundant Network - failover IP

The main indicator of being an active master is holding the unique IPs on each subnet that indicates that you are in charge - 192.168.0.254 and 192.168.1.254. These are floating IPs, configured so that the active master will take over them.

Due to the slight mismatch between modern NetworkManager and older networking, there is a helper cronjob running once a minute to synchronise these states.

Camera/Compute node configuration - DHCPD

The masters are in charge of configuring the IP addresses of the compute and cameras, and it does this via DHCP, controlled by the Internet Systems Consortium originating isc-dhcp-server. This process in turn is configured via etc/dhcpd.conf, which is filled in from a template, and the data contained within the central hardware database running in Docker.

This configuration is compared once a minute via a crontab, and if the dhcpd contents is different from that coming in from the database, the service is restarted automatically to update for new camera/compute serial numbers.

An additional aspect of this is that it allows for secondary masters to be defined, and booted via PXE/DHCP into a preparatory state. Further details are below.

Compute node execution - TFTP/NFS

To reduce the amount of external complexity, and to ensure a clean working environment for the compute nodes, these devices are configured to boot over the network via PXE. In order to give them their kernel, once PXE has passed into GRUB, the kernel and initramfs is transferred via Trivial File Transfer Protocol (TFTP), which is the standard way to netboot. This daemon is also controlled on the masters via Pacemaker.

The next stage of booting is to gain a root filesystem. For this system, the root filesystem is hosted from the masters via NFS in a readonly fashion, ensuring that on each boot, the compute nodes always get the same configuration. Once this is done, however, the network configuration of the compute nodes can no longer be changed without disrupting the root filesystem, so before this mount is made, the network bonding of four gigabit ethernet ports is done, via a script as part of the initramfs initialisation process.

To allow the compute node to perform some local file modifications such as logging or temporary debugging, an overlayfs is constructed as part of the pivot process. This ensures that the root filesystem is read-write, whilst the actual underlying filesystem is read-only, preventing potential data corruption.

The NFS server daemon is controlled from the masters via Pacemaker.

Redundant filesystem - DRBD

Another aspect of the redundant masters is that they have a replicated filesystem shared between them. This is enabled via the use of Distributed Redundant Block Devices (DRBD) that runs at kernel-level, intercepting all block writes and replicating those over the backend network to the other master, ensuring that the two keep in-sync at all times. This filesystem is configured into a two-primary state to allow it to remain instantly available should there be a hardware failure, the mounting of which controlled by Pacemaker to prevent corruption.

There are three elements to this redundant filesystem - the client filesystem, the Docker files, and the logging and config data for the compute nodes. These are on three different targets.

Docker

Docker is used to host the database and frontend elements - it is here that the Panel PC connects to to display the visuals in the cabin for the machine. It is configured to use one of the DRBD mounts, so that if there is a failure in the system somewhere, these containers will immediately be started on the other master.

The actual frontend configuration consists of three individual containers, set up via docker-compose from locally-built Dockerfiles. There is a cronjob running once a minute, attempting to pull any updates to the frontend git repo, and if there is one, updating the running containers using docker-compose.

Additional Functions - DNS

An internal DNS server is provided as part of the master services, to allow for a more intuitive control scheme. This service is controlled via Pacemaker, and is statically built, defining a new internal Top Leve. Domain (tld) of .broc for all known IPs inside the machine config.

Secondary master installation

Once a new secondary master is entered into the database and the dhcpd config updated, when booted onto the network this machine will automatically get an SSH key from the current master and then SSH back into it as user master-builder, opening a socket file for itself there. By doing this, it exposes a shell on this new machine, with the initramfs as its base filesystem in memory. There is a second script that can then be run, which will install this second master via the socket file, allowing it to use that key to log in as root on the current master and clone the entire filesystem.

Comments

Please register or sign in to add a comment.

theory masters