It does not reflect the actual package and database!!!!**
The workflow should be separated from the functionality, i.e. R package containing functionality, scripts (or other packages) which contain the workflow load the functionality package and condtrol the analysis.
the forecast should be triggered automatically after the data has been updated, thus the forecasts are perfectly in sync with the data. The Forecast repo will contain a link to the tag of the data used for the last release. it would be ideal to have the same tags for Data release and Forecast release.
Trusted Timestamps will be automatically created using the
ROriginStamp
package when data is archived.
All file sizes are per sample.
.c6
BioBase
package reads .fca
into R,
andflowCore
.csv
with count of particles of gated
dataset.c6
): 100…200MB.csv
): 1…2MB.c6
or file or the one
converted to .fca
? Large storage space
requirements!!!!.csv
with count of each species.IMAGE
): a lot.csv
): < 1MBBemovi
R package.csv
file with size info for each individual
particle identified.VIDEO
): 500MB.csv
): < 1MB.csv
file with count of individuals in each species and
dilution.xls
): < 1MB.csv
): < 1MB.csv
file with info on O2 and temperature.csv
): < 1MB.csv
): < 1MBNeed info from Yves.
.csv
): < 1MB.csv
): < 1MBThese need to be enabled in the repo itself via webhooks.
Alternatively, one could also call them from R via a curl
command - advantages would be more control?
Format has to be human readable, YAML and one can use the package
config
https://cran.r-project.org/web/packages/config/ to load
the configs easily. Also, one can specify the configuration for Master
as well as Child repo in one config file which can be in all repos of
the set.
Possible fields (with default valuse) could be (if one is using R instead of webhooks for DOI and TTS):
default:
doi: FALSE
tts: TRUE
data:
backend:
master:
doi: FALSE
tts: TRUE
data:
backend:
mssql:
Database: "[database name]"
UID: "[user id]"
PWD: "[pasword]"
Port: 1433
public:
doi: TRUE
tts: TRUE
data:
backend:
csv:
folder: rawData
heatwave_e_1:
doi: TRUE
tts: TRUE
data:
backend:
csv:
folder: rawData
heatwave:
doi: FALSE
tts: TRUE
data:
backend:
mysql:
Database: "[database name]"
UID: "[user id]"
PWD: "[pasword]"
In addition, the repo will contain one file named CONFIG_NAME which
contains the name of the config to be used, e.g. master
if
it is the nmaster config
data | child repo key |
---|---|
… | public_MSC_Peter |
… | public_PhD_Mary |
… | heatwave_private |
… | heatwave |
Do we need multiple child repos? Should be easy to implement and adds flexibility.
If there are not to may, a YAML config file could be used, otherwise a table. The YAML file would be easier to edit. ##### Table
child repo key | child repo | from | until |
---|---|---|---|
public_MSC_Peter | LEEF.public | 12.05.2019 | |
public_PhD_Mary | LEEF.public | 12.05.2022 | |
heatwave_e_1 | LEEF.heatwave.public | 01.01.2019 | 01.01.2019 |
heatwave_private | LEEF.heatwave | 01.01.2018 | 01.01.2020 |
existingDataOK()
)newDataOK()
importData()
publishData()
)Github repos are used to archive the data and the forecasts. They also host the R packege which containe the functionality, but here I will focus on the Data github repos.
We need some private repos, as some data will be embargoed for some time due to thesis and publication priorities.
Repo comntaining this document and all information about the other repos and links.
This repo is used for archiving all data. It will contain the checked and cleaned data. Functions from the Publishing R package can be used to calculate summary stats and show these on a website.
The repo is structured as an R package which contains all the data.
In addition, the data is stored in a csv format in the
rawData
folder for access from other programs. The easiest
way to get the updated data is to update the package in R.
This repo, when receiving a pull request, triggers a Travis-ci build to - check the data contained in the pull request - clean the data contained in the pull request - update the data in the repo if the data is OK and cleaned successfully via a commit as new version. - Trusted Timestamp for all transactions - publishibng of public data to public repo - after updating, forecasting is triggered
Layout Based on https://gist.github.com/QuantumGhost/0955a45383a0b6c0bc24f9654b3cb561
See the LEEF.Processing package
bftools from https://docs.openmicroscopy.org/bio-formats/5.8.2/users/comlinetools/conversion.html (2018/06/22)
pre-processors
and extractors
into
separate packagesThese will contain the functions and the code to add it to the queue for processing. This adds flexibility as additional data sources can be added easier.
Which metadata should be stored and in which format? Metadata is required at different levels:
.rds
files in the
LastAdded
folderThis function will publish the data at the is only necessary for public repos, but the functionality needs to be included here. This needs to be done in a github repo - gitlab possible?
After the TTS is obtained, the seed is downloiaded and saved in the directory as well. This needs to be automated and done later, as the seed is only avvailable about 24 hours after the initial submission.
This is unlikely using the current structure and hopefully new options will materialise after the Bern meeting.
Request TTS (Trusted Time Stamp) is obtained from
archive_new_data()
and is stored in the samd directory as
the archive
This is used only to checksum the archive and for obtaining the TTS for the archive hash.
Conversion to open formats for archiving and fuerther procesing
Extraction of dat and storage in data.table / tibble for addition to database
raw data
is for archiving, DOI, …At the moment, data is converted from proprietory to open formats and archived afterwards.
Remove the CONFIG.NAME
file and incorporate it in the
config.yml
file.
Configuration should be done by running a
initialize_db()
function which 1) reads
config.yml
from working directory 2) creates directory
structure (if not already existing) 3) creates empty database (if not
slready existing) 4) adds pre-processors
,
extractors
, … This will make the usage of different
databases much easier and the package much more versatile
This repo contains an R package for
which contains data from the experiments.
It only contains the infrastructure, while the source specific processing is provided by source specific packages which are called from this package.
The data is stored locally in a directory structure an SQLite database, but remote storage is planned.
Functions which need to be called by the user are: *
initialize_db()
: to read the config file and to setup the
needed directory structure * import_new_data()
: to import
new data * TODO `
The extaction of the data from the flowcytometer depends on bioconductor packages. They can be ionstalled as followed (details see https://bioconductor.org/install/#install-bioconductor-packages )
It is also possible, that the following packages have to be installed by hand:
As the package is only on github, you need devtools
to
install it easily as a prerequisite
The package needs some information to be able to handle the import,
storage and retrieval of the data. This information is stored in a file
called by default config.yml
.
The default config.yml
included in the package looks as
followed:
[1] "# General info"
[2] "# --------------------------------------------------------------------"
[3] "# General info regarding the processing"
[4] ""
[5] ""
[6] "name: LEEF"
[7] "description: LEEF Data from an long term experiment."
[8] " Some more detaled info has to follow."
[9] "maintainer: Rainer M. Krug <[email protected]>"
[10] ""
[11] ""
[12] "# --------------------------------------------------------------------"
[13] "# Directorie for the processing"
[14] "# --------------------------------------------------------------------"
[15] "## The folder structure in this directory"
[16] "## has to be one folder for each measurement type."
[17] "##"
[18] ""
[19] ""
[20] "directories:"
[21] " raw: \"0.raw.data\""
[22] " pre_processed: \"1.pre-processed.data\""
[23] " extracted: \"2.extracted.data\""
[24] " archive: \"3.archived.data\""
[25] " backend: \"9.backend\""
[26] " tools: \"tools\""
[27] ""
[28] ""
[29] "# --------------------------------------------------------------------"
[30] "# Packages which contain the pre_processors, extractors, additors, ..."
[31] "# --------------------------------------------------------------------"
[32] "# These will be installed using the `InstallCommand` and registered"
[33] "# in the queue using the `RegisterCommand`."
[34] "# The RegisterCommand` can also contain additional customizations needed by the processors."
[35] ""
[36] ""
[37] "measurement_packages:"
[38] " LEEF.measurement.bemovi:"
[39] " name: LEEF.measurement.bemovi"
[40] " InstallCommand: drat::addRepo('LEEF-UZH'); install.packages('LEEF.measurement.bemovi')"
[41] " RegisterCommand: LEEF.measurement.bemovi::tools_path(tools_path = '.'); LEEF.measurement.bemovi::register()"
[42] " LEEF.measurement.flocam:"
[43] " name: LEEF.measurement.flowcam"
[44] " InstallCommand: drat::addRepo('LEEF-UZH'); install.packages(('LEEF.measurement.flowcam')"
[45] " RegisterCommand: LEEF.measurement.flowcam::register()"
[46] " LEEF.measurement.flowcytometer:"
[47] " name: LEEF.measurement.flowcytometer"
[48] " InstallCommand: drat::addRepo('LEEF-UZH'); install.packages('LEEF.measurement.flowcytometer')"
[49] " RegisterCommand: LEEF.measurement.flowcytometer::register()"
[50] " LEEF.measurement.manualcount:"
[51] " name: LEEF.measurement.manualcount"
[52] " InstallCommand: drat::addRepo('LEEF-UZH'); install.packages('LEEF.measurement.manualcount')"
[53] " RegisterCommand: LEEF.measurement.manualcount::register()"
[54] " LEEF.measurement.o2meter:"
[55] " name: LEEF.measurement.o2meter"
[56] " InstallCommand: drat::addRepo('LEEF-UZH'); install.packages('LEEF.measurement.o2meter')"
[57] " RegisterCommand: LEEF.measurement.o2meter::register()"
[58] ""
[59] ""
[60] ""
[61] "# --------------------------------------------------------------------"
[62] "# archival_packages"
[63] "# --------------------------------------------------------------------"
[64] "# These will be installed using the `InstallCommand` and registered"
[65] "# in the queue using the `RegisterCommand`."
[66] "# The RegisterCommand` can also contain additional customizations needed by the processors."
[67] "# Additional values are archival package specific."
[68] ""
[69] ""
[70] "archive_packages:"
[71] " LEEF.archive.default:"
[72] " name: LEEF.archive.default"
[73] " InstallCommand: drat::addRepo('LEEF-UZH'); install.packages('LEEF.archive.default')"
[74] " RegisterCommand: LEEF.archive.default::register(compression = \"none\")"
[75] ""
[76] "# --------------------------------------------------------------------"
[77] "# backend packages ()"
[78] "# --------------------------------------------------------------------"
[79] "# These will be installed using the `InstallCommand` and registered"
[80] "# in the queue using the `RegisterCommand`."
[81] "# The RegisterCommand` can also contain additional customizations needed by the processors."
[82] "# Additional values are arcival package specific."
[83] ""
[84] "## NOT IMPLEMENTED YET"
[85] "## SOME MORE THOUGHT NEEDED HERE!"
[86] ""
[87] ""
[88] "backend_packages:"
[89] " LEEF.backend.csv:"
[90] " name: LEEF.backend.csv"
[91] " InstallCommand: drat::addRepo('LEEF-UZH'); install.packages('LEEF.backend.csv')"
[92] " RegisterCommand: LEEF.backend.csv::register()"
[93] " # LEEF.backend.sqlite:"
[94] " # name: LEEF.backend.sqlite"
[95] " # InstallCommand: drat::addRepo('LEEF-UZH'); install.packages('LEEF.backend.sqlite')"
[96] " # RegisterCommand: LEEF.backend.sqlite::register()"
[97] " # dbpath:"
[98] " # dbname: 'LEEFData.sqlite'"
[99] ""
[100] ""
[101] "# --------------------------------------------------------------------"
[102] "# Trusted Time Stamps"
[103] "# --------------------------------------------------------------------"
[104] "## NOT IMPLEMENTED YET"
[105] "## SOME MORE THOUGHT NEEDED HERE!"
[106] ""
[107] ""
[108] "tts:"
[109] " create: TRUE"
[110] " api_key: PRIVATE"
[111] " notification:"
[112] " notification_type: 0"
[113] " target: [email protected]"
[114] ""
[115] ""
[116] "# --------------------------------------------------------------------"
[117] "# DOI"
[118] "# --------------------------------------------------------------------"
[119] "## NOT IMPLEMENTED YET"
[120] ""
[121] ""
[122] "doi: FALSE"
[123] ""
[124] "# --------------------------------------------------------------------"
[125] "# queues containing functions"
[126] "# --------------------------------------------------------------------"
[127] "## These should be left blank, as they WILL be owerwritten."
[128] ""
[129] ""
[130] "queues:"
[131] " pre_processors:"
[132] " extractors:"
[133] " archivers:"
[134] " additors:"
[135] ""
The fastest way to start a new data storage infrastructure is to create an empty directory and to copy the config file into this directory. Afterwards, change the working directory to that directoryt and initialise the folder structure and the package:
# library("LEEF")
devtools::load_all(here::here())
nd <- "Data_directory"
dir.create( nd )
setwd( nd )
file.copy(
from = system.file("config.yml", package = "LEEF"),
to = "."
)
initialize_db()
After that, the data to be imported can be placed into the ToBeImported folder and imported by calling
The ToBeImported
folder contains subfolder, which are
equal to the name of the table in the database where the results will be
imported to. Details will follow later.
Data can be imported from the folder
config.yml
In the config.yml
file are all configuration options of
the infrastructure. The example one looks as follow:
default:
maintainer:
name: Max Mustermann
email: Max@musterfabrik.example
description: This is my fantastic repo
with lots of data.
master:
doi: FALSE
tts: TRUE
data:
backend:
mysql:
Database: "[database name]"
UID: "[user id]"
PWD: "[pasword]"
Port: 3306
LEEF:
doi: TRUE
tts: TRUE
data:
backend:
sqlite:
folder: inst/extdata
The top levels are different repo names (here master
and
LEEF
) and default values for all the repos
(default
).
It has the following keywords: - default
contaiuns
self-explanatory values - doi
if TRUE, will request DOI
after successfull commit - tts
if TRUE, will request
Trusted Time Stamp of repo after successfull commit - data
contains info about the data storage - backend
the name of
the backend and connection info - others to
come
This config file should be the same for all repos in a tree, but it is not necessary.
CONFIG.NAME
This file contains the name of the configuration for this repo. in
this example master
. So by changiong the content of this
file to LEEF
, this repo becomes a LEEF
repo.
Before new data can be imported, this repo needs to be forked. Once forked, all user interaction takes place via this fork and pull requests to this repo. This fork needs to be cloned to a local computer.
An import of new data follows the following steps:
inst/new_data/
.
The data has to follow the following rules:
.csv
inst/newdata
directory is not emptyafter_success section
As soon as this repo receives a pull request, it initiates the import
via a Travis job. This importing is handled within the package
LEEF.Processing
and can be seen in detail in the package
documantation.
For relevance here are two actions:
LEEF.Processing
does commit the imported data to this
repository or to the external database.The Data package contains the code to
check_existing_...()
to check the existing datacheck_new_...()
to check the incoming new datamerge_...()
to merge the new data into the existing
dataIn addition, it contains the function
existing_data_ok()
which is doing all checks on the
existing datanewData_ok()
which is doing all checks on the new
datamerge_all()
which is doing all the mergingand finally
do_all()
which is doing everything in the order of
check_existing_data()
check_new_data()
merge_all()
existing_data_ok()
import_new_data()
functionLayout Based on https://gist.github.com/QuantumGhost/0955a45383a0b6c0bc24f9654b3cb561
pre-processors
and extractors
into separate
packages