Ohio bee tracker data pipeline

This repository defines a service that performs automatic transfer of data from the Raspberry Pi (RPI) machines deployed at the University of Ohio to the research infrastructure at the University of Sheffield. It runs a task on a regular schedule that copies data from the remote machines and deletes old files via a secure shell (SSH) connection.

Overview

This service runs regularly and iterates through the RPIs one at a time, copy the research data, and prevent the storage on the remote devices from filling up. For each remote machine, the process works as follows:

Connect to a remote machine via secure shell;
Sync all the data files to a specified directory;
Upon successful transfer, delete each data file from the remote machine.

Remote sync

The data transfer is implemented using the remote synchronizing (rsync) tool which compresses files during transit and removes the files from the remote machine using the --delete option. If a file is modified during transfer, rsync will fail and that file will be transferred during the subsequent run.

It will only delete files only after a successful sync, to avoid accidentally deleting data that hasn't been transferred first. This is designed to prevent accidental data loss and preserve the limited storage space on the remote machines.

Repository contents

The repository contains the following directories and files:

./systemd contains the systemd units that define the service using the Ubuntu service management system.
- The timer (copy-to-storage.timer) will run on a regular schedule and initiate the service (copy-to-storage.service) which runs a shell script that performs the data operations.
./scripts/copy-to-storage.sh is a Bash shell script that iterates over the target machines and runs the data transfer and file deletion operations.

There are also some useful Unix shell scripts for inspecting the remote machines.

Installation

Please follow the following steps to set up the data pipeline machine. These instructions only need to be followed once when the pipeline is first set up. To add new Raspberry Pis, please see the relevant section below. Also, see install.sh. These steps assume that a recent Linux (Ubuntu) operating system is used.

Install the systemd service

First, install dependencies:

sudo apt install rsync

Create the necessary service user accounts with write permissions to the research storage area. The service uses its own system user account that is defined in scripts/copy-to-storage.service in the [Service] User= option.

Set up the SSH keys (see the SSH configuration section below).

Clone this repository.

Install systemd units.

sudo cp --verbose ./systemd/*.service /etc/systemd/system/
sudo cp --verbose ./systemd/*.timer /etc/systemd/system/

Reload the systemd units using systemctl

sudo systemctl daemon-reload

Install the shell script:

sudo mkdir /opt/data-pipeline
sudo cp ./scripts/copy-to-storage.sh /opt/data-pipeline/copy-to-storage.sh

Enable the service. (This will not activate the service.)

sudo systemctl enable copy-to-storage

To activate the service, please read the usage instructions below.

SSH configuration

The service uses a specific SSH configuration to enable the rsync command to establish a connection to the Raspberry Pis (RPIs) and transfer data into the University of Sheffield (UoS) campus network. This system connects to the target machines using the cloud machine as a "jump" host, where this third machine is an intermediate.

Reverse tunnels

The diagram below shows the different machines involved and how the SSH connections are set up. For more information, see issue #16. Each arrow represents an SSH connection, where the thick arrows indicate remote port forwarding to establish a reverse tunnel, where a local port on one machine is bound to a persistent SSH connection on the other machine.

---
title: SSH remote port forwarding
---
flowchart TD
subgraph AWS
awsbox[iot.bugtrack.org.uk]
end
subgraph University of Sheffield
ohiobeeproject --> awsbox
end
subgraph Ohio University
raspberry1 == "Forwarding" ==> awsbox
raspberry2 == "Forwarding" ==> awsbox
raspberry3 == "Forwarding" ==> awsbox
end

This means we can connect directly from the University of Sheffield (UoS) campus network onto the Ohio campus network using the Amazon Web Services (AWS) virtual machine as an intermediate jump host.

---
title: Secure shell connections
---
sequenceDiagram

participant ohiobeeproject
participant iot.bugtrack.org.uk
participant raspberry1

ohiobeeproject->>iot.bugtrack.org.uk: SSH Proxy Jump
iot.bugtrack.org.uk->>raspberry1: SSH Connection
ohiobeeproject-->>raspberry1: SSH Connection

Each machine must be able to connect to its desired target automatically, without human intervention. To make the remote hosts accept key-based authentication, we need to configure the authorized_keys file each target machine (the jump host and the Raspberry Pis). The configuration below should be set up on the UoS virtual machine. The public keys must be installed on the remote hosts located at AWS and Ohio to enable automatic key-based authentication.

Connect to the data pipeline machine:

ssh <username>@ohiobeeproject.sheffield.ac.uk

The following settings assume we're acting as the service account:

sudo su - ohiobeeprojectsvc

Jump host

For the data transfer service machine (ohiobeeproject) to connect to the jump host, we need an SSH key. This only needs to be done once, when the connections are first configured. On the ohiobeeproject machine, create a key for the jump host and copy the public key to the target machine.

user="data-pipeline-svc"
# Create an SSH key (this will create private and public keys)
ssh-keygen -f ~/.ssh/bugtrack -N "" -t ecdsa
# Copy to the jump host
scp ~/.ssh/bugtrack.pub $user@iot.bugtrack.org.uk:~/.ssh/authorized_keys

Raspberry Pis

To set up connections to new Raspberry Pi devices, please run the following steps on the ohiobeeproject machine to create and install private keys and public keys is the appropriate places. The diagram below gives an overview of which SSH keys and configuration files should exist at each location.

---
title: SSH file locations
---
flowchart LR
  subgraph ohiobeeproject
    private_key1@{ shape: doc, label: "AWS Private key" }
    public_key1@{ shape: doc, label: "AWS Public key" }
    private_key2@{ shape: doc, label: "RPI nn private key" }
    public_key2@{ shape: doc, label: "RPI nn public key" }
    known_hosts@{ shape: doc }
    ssh_config@{ shape: doc, label: "~/.ssh/config" }
  end

  subgraph bugtrack
    authorized_keys1@{ shape: doc, label: "~/.ssh/authorized_keys" }
  end

  subgraph raspberrynn
    authorized_keys2@{ shape: doc, label: "~/.ssh/authorized_keys" }
  end

  public_key1 -. "Copy" .-> authorized_keys1
  public_key2 -. "Copy" .-> authorized_keys2

Specify the identifiers of the target machines, either a numerical range or specific numbers.

raspberry_ids="$(seq 1 50)"
raspberry_ids="31 34 35"

On the ohiobeeproject machine, generate SSH private and public keys for each target machine.

for i in $raspberry_ids
do
  host="raspberry$i"
  ssh-keygen -f ~/.ssh/$host -N "" -t ecdsa
done

Configure the jump connection using the SSH configuration file

nano ~/.ssh/config

A Bash script to generate most of the config file:

for i in $raspberry_ids
do
  host="raspberry$i"
  port=$((5000 + $i))
  printf "host raspberry$i\n  hostname localhost\n  user pi\n  port $port\n  identityfile ~/.ssh/$host\n  proxyjump awsbox\n\n"
done

The SSH configuration file should be saved on the ohiobeeproject virtual machine at the location /home/ohiobeeprojectsvc/.ssh/config. This file should look something like this, with an entry for the jump host and entries for each target remote host:

# AWS EC2 instance
host awsbox
  hostname iot.bugtrack.org.uk
  port 22
  identityfile ~/.ssh/bugtrack
  user data-pipeline-svc

# Raspberry Pi
host raspberry1
  hostname localhost
  port 5001
  user pi
  identityfile ~/.ssh/raspberry1
  proxyjump awsbox

Copy the specific public key from the ohiobeeproject VM to each Raspberry Pi (this step will require username-password authentication) to enable passwordless key-based authentication using the authorized_keys file.

for i in $raspberry_ids
do
  host="raspberry$i"
  scp ~/.ssh/$host.pub $host:~/.ssh/authorized_keys
done

We can now set up the known_hosts file on the ohiobeeproject VM, which stores recognised remote machines.

ssh-keyscan -H iot.bugtrack.org.uk >> ~/.ssh/known_hosts

Next, check the key fingerprint for each Ohio host. You need to enter yes for each prompt to confirm that the host key fingerprint is correct. This only needs to be done once when the connection is first configured.

for i in $raspberry_ids
do
  host="raspberry$i"
  echo $host
  ssh $host -t "ip addr show | grep link/ether"
done

To test this out manually, try a passwordless connection to a single remote host:

ssh raspberry31

Usage

The services defined in this repository are systemd units that are controlled using systemctl.

Configuration

TODO

Monitoring

View the service status

sudo systemctl status copy-to-storage.timer

To view the systemd logs using journalctl:

sudo journalctl -u copy-to-storage.service --lines=100

You can watch it run live by using the follow option:

sudo journalctl -u copy-to-storage.service --follow

Service control

The timer (copy-to-storage.timer) will run on a regular schedule and initiate the service (copy-to-storage.service).

Ensure the service is activated

sudo systemctl enable copy-to-storage.timer

Start the timer

sudo systemctl start copy-to-storage.timer

Stop the timer

sudo systemctl stop copy-to-storage.timer

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.github/workflows		.github/workflows
scripts		scripts
systemd		systemd
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ohio bee tracker data pipeline

Overview

Remote sync

Repository contents

Installation

Install the systemd service

SSH configuration

Reverse tunnels

Jump host

Raspberry Pis

Usage

Configuration

Monitoring

Service control

About

Uh oh!

Uh oh!

Languages

License

SheffieldMLtracking/data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Ohio bee tracker data pipeline

Overview

Remote sync

Repository contents

Installation

Install the systemd service

SSH configuration

Reverse tunnels

Jump host

Raspberry Pis

Usage

Configuration

Monitoring

Service control

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages