---
title: "01 Background LEEF Data"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{01 Background LEEF Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
# library(LEEF)
```
# Initial Remarks and Prerequisite

The pipeline server is located in the S3IT Science Cloud and is accessible from within the UZH network.

To access it from outside the UZH network, it is necessary to use the **UZH VPN**!

This applies to **all** activities on the server, e.g. uploading, downloading, managing and mounting the samba share!

## Manage Pipeline Server Instance
To manage the pipeline server instance itself, you have to connect to Dashboard of the S3IT Science cloud at [https://cloud.s3it.uzh.ch/auth/login/?next=/](https://cloud.s3it.uzh.ch/auth/login/?next=/). Normally, no interaction with the dashboard is necessary. See the [Admin Guide](LINK) for details.

## bash Shell

Before you can use the bash scripts for management of the pipeline, you need a terminal running a bash shell. These are included in Mac and Linux, but need to be installed on Windows. Probably the easiest approach is to use the Windows Subsystem for Linux. This can relatively easily be installed as described [here](https://www.howtogeek.com/249966/how-to-install-and-use-the-linux-bash-shell-on-windows-10/). Please see [this How-To Geek article](https://www.howtogeek.com/261591/how-to-create-and-run-bash-shell-scripts-on-windows-10/) for details on how to run bash script after the WSL is installed. In this Linux Bash Shell in Windows you can execute the bash scripts provided.

Other options for Windows include <span style="color:red">To be added after input from Uriah</span>.

## ssh Client
To be able to remotely log in the pipeline server, an ssh client is needed. In Mac, Linux and the WSL are these builtin. Another widely used ssh shell under Windows is provided by [putty](https://www.putty.org).

## ssh keys
For any interaction with the pipeline server you have to authenticate. The pipeline server is setup to only accept passwordless logins (except of mounting the SAMBA share), which authenticates by using a so called ssh certificate, which is unique to your computer. This is safer than password authentication and much easier to use once setup.

Before you generate a new one, you should check if you already have one (not impossible) by executing

```{bash, eval = FALSE}
cat ~/.ssh/id_rsa.pub
```

from the bash shell. If it tells you `No such file or directory`, you have to generate one as described in the [S3IT training handout](https://docs.s3it.uzh.ch/cloud/training/training_handout/#create-a-key-on-linuxmacwindows10) for Mac, Windows or Linux.

After you have an ssh key for your computer, you have to contactsomebody who already has access to the pipeline server to have it added to the instance so that you can interact with the pipeline server either via the local pipeline management scripts (which are using ssh) or logging in using ssh directly (username `ubuntu`) and executing commands  on the pipeline server.

# LEEF Data
## Measurements in the Pipeline
- **bemovi** see [link](location) for manual
- **flowcam** see [link](location) for manual
- **flowcytometer** see [link](location) for manual
- **manual** count see [link](location) for manual
- **O~2~ meter** see [link](location) for manual

# Raw Data Folder Structure for the Pipeline

The data submitted to the pipeline consists out of two folder: one `0.raw.data` folder containing the measured data and measurement specific metadata, and the folder `00.general.parameter` containing the metadata and some data used by all measurements. The folder structure has to be as follows:

```
0.raw.data
├── bemovi.mag.16
│   ├── ........_00097.cxd
│   ├── ...
│   ├── ........_00192.cxd
│   ├── bemovi_extract.mag.16.yml
│   ├── svm_video.description.txt
│   ├── svm_video_classifiers_18c_16x.rds
│   └── svm_video_classifiers_increasing_16x_best_available.rds
├── bemovi.mag.25
│   ├── ........_00001.cxd
│   ├── ...
│   ├── ........_00096.cxd
│   ├── bemovi_extract.mag.25.cropped.yml
│   ├── bemovi_extract.mag.25.yml
│   ├── svm_video.description.txt
│   ├── svm_video_classifiers_18c_25x.rds
│   └── svm_video_classifiers_increasing_25x_best_available.rds
├── flowcam
│   ├── 11
│   ├── 12
│   ├── 13
│   ├── 14
│   ├── 15
│   ├── 16
│   ├── 17
│   ├── 21
│   ├── 22
│   ├── 23
│   ├── 24
│   ├── 25
│   ├── 26
│   ├── 27
│   ├── 34
│   ├── 37
│   ├──flowcam_dilution.csv
│   ├──flowcam.yml
│   ├──svm_flowcam_classifiers_18c.rds
│   └──svm_flowcam_classifiers_increasing_best_available.rds
├── flowcytometer
│   ├── ........
│   ├── .........ciplus
│   ├──gates_coordinates.csv
│   └──metadata_flowcytometer.csv
├── manualcount
│   └── .........xlsx
└── o2meter
    └── .........csv

```


```
00.general.parameter
├── compositions.csv
├── experimental_design.csv
└── sample_metadata.yml
```


see the   [document](https://teams.microsoft.com/l/file/38CD9349-F25C-4773-B08A-39329D23F221?tenantId=c7e438db-e462-4c22-a90a-c358b16980b3&fileType=docx&objectUrl=https%3A%2F%2Fuzh.sharepoint.com%2Fsites%2FLEEF%2FShared%20Documents%2FGeneral%2F05_protocols%2FData_upload_manual.docx&baseUrl=https%3A%2F%2Fuzh.sharepoint.com%2Fsites%2FLEEF&serviceName=teams&threadId=19:d14a98eb700e4721b35583265c27e8a3@thread.tacv2&groupId=13ba9e9a-1d5d-42b1-9a95-08fd956973cc) on Teams with the detailed steps necessary to assemble the data and the necessary metadata.

These two folders need top be uploaded to the pipeline server and the pipeline needs to be started.

## Uploading and managing the Pipeline
There are two approaches of uploading the data to the pipeline server and start the pipeline afterwards: using [local bash scripts](c_02_running_pipeline_lpm.html) from a local computer or executing the commands [from the pipeline server](c_03_running_pipeline_manual.html).

The recommended approach is to use the [local bash scripts](c_02_running_pipeline_lpm.html), as this will minimise the likelihood of errors or accidental data loss. Nevertheless, for some actions it might be necessary to work directly on the pipeline server, usually via an ssh session and to execute commands [on the pipeline server](c_03_running_pipeline_manual.html).

# Folder structure after pipeline

After completing the pipeline, the folder `LEEF` on the pipeline server will look as follws:

```
./LEEF
├── 0.raw.data
│   ├── bemovi.mag.16
│   ├── bemovi.mag.25
│   ├── flowcam
│   ├── flowcytometer
│   ├── manualcount
│   └── o2meter
├── 00.general.parameter
│   ├── compositions.csv
│   ├── experimental_design.csv
│   └── sample_metadata.yml
├── 1.pre-processed.data
│   ├── bemovi.mag.16
│   ├── bemovi.mag.25
│   ├── flowcam
│   ├── flowcytometer
│   ├── manualcount
│   └── o2meter
├── 2.extracted.data
│   ├── bemovi.mag.16
│   ├── bemovi.mag.25
│   ├── flowcam
│   ├── flowcytometer
│   ├── manualcount
│   └── o2meter
├── 3.archived.data
│   ├── extracted
│   ├── pre_processed
│   └── raw
├── 9.backend
│   ├── LEEF.RRD.sqlite
│   ├── LEEF.RRD_bemovi_master.sqlite
│   ├── LEEF.RRD_bemovi_master_cropped.sqlite
│   ├── LEEF.RRD_flowcam_algae_metadata.sqlite
│   └── LEEF.RRD_flowcam_algae_traits.sqlite
├── log.2021-03-03--15-06-32.fast.done.txt
├── log.2021-03-03--15-06-32.fast.txt
├── log.2021-03-03--15-14-20.bemovi.mag.16.done.txt
├── log.2021-03-03--15-14-20.bemovi.mag.16.error.txt
├── log.2021-03-03--15-14-20.bemovi.mag.16.txt
├── log.2021-03-03--15-14-20.bemovi.mag.25.done.txt
├── log.2021-03-03--15-14-20.bemovi.mag.25.error.txt
└── log.2021-03-03--15-14-20.bemovi.mag.25.txt

```
## 1.pre-processed.data

This folder contains the pre-processed data. Pre-processed means, that the raw data (`0.raw.data`) is converted into open formats where this is possible to be done **lossless** and compressed (in case of the bemovi videos). All further processing is done with the pre-processed data.

## 2.extracted.data

This folder contains the data which will be used in the further analysis outside the pipeline. It contains the intermediate extracted data as well as the data which will finally be added to the backend (`9.backend`). The final extracted data for the backend is in `csv` format.

## 3.archived.data

Data is archived as raw data, pre-processed data, and extracted data. In the respective folders, a folder using the timestamp as specified in the `sample.metadata.yml` is created containing the actual data. The raw data as well as the pre-processed data from the bemovi is just copied over, while the others are in form of `.tar.gz` archives. Of all files sha256 hashes are calculated to guarantee the correctness of the data.

## 9.backend

The backend consists of the sqlite database **LEEF.RRD.sqlite** which containes the Research Ready Data (RRD) of all measurements. This one will be used for the further analysis.

## log.xxx.txt

Contains the log files for the run of the pipelines. In addition, in each numbered directory, there is also a single log file.


# Pipeline server samba share

The LEEF folder of the pipeline server is accessible via a samba share, although this is rather slow and is not the preferred way of uploading / downloading data. Nevertheless, it is useful to mount the pipeline server to see the progress of the pipeline and investigate errors further.

The share shows the LEEF folder. For a description see the [Background LEEF Data](location) article.

The credentials are available in a private internal document.

## Mount samba share on a Mac
- open Finder
- press ⌘K to open 'Connect to Server ...'
- enter `smb://USERNAME@IP` and when prompter for a password, enter it and tick the box 'Remember password' (or similar wording
the next time, you will not have to enter the password

## Mount samba share on Windows

- no idea - somebody with a windows computer has to provide that info