1. DataSHIELD packages structure

DataSHIELD packages have a very particular structure, that is in first instance defined by being a combination of two package that act simultaniously. The reason for this is that when we analyze data through DataSHIELD we have two actors, 1 the client (who does the analysis) and 2 the server (where the data is stored), given the “taking the analysis to the data” approach of DataSHIELD we can easily understand that one package will be communicating from the analyst laptop to the package on the data server.

We can illustrate this interaction with a simple figure.

DataSHIELD architecture

Looking at this figure we can clearly identify the two distinct actors. Each of them will use a different analysis package. For that reason when we develop a DataSHIELD package, we are actually developing a collection of two packages that will work in tandem.

Naming conventions

As with many different software and architectures, there are a set of (somtimes un-) written rules that developers stick to. In this case, no official writting is available, so this document can serve as a starting point.

First and foremost, use camelCase naming convention.

Naming the client package

The naming for the client package will follow this rule

dsXxxClient

Where Xxx is the actual name. E.g. a metabolomics exposome package could be named dsMetabolomicsClient.

Naming of the server package

The server package will use the same Xxx and be named as

dsXxx

Following on the metabolomics example we would have dsMetabolomics.

Function naming on the client

The naming of functions on the client package will follow

ds.Xxx

E.g. in metabolomics ds.ComputeMetaboScore.

Function naming on the server

Most times a client function will call a specific one on the server, therefore we name them the same.

XxxDS

E.g. in metabolomics ComputeMetaboScoreDS.

Please note this is not a strict rule, it is just a good practice. If all developers stick to ~ similar ~ rules, the feel for the actual users will be smoother, which will improve engagement on the project.

DSI: Communicating from client to server

Now that we understand why two packages are required and how to name them, let’s see which tool we have to perform communications between them. The tool that we will use is called DSI, short for “DataSHIELD Interface”. This tool will allow us to send queries from the packages on the client to the server ones, as well as recieve the server response with the results that we want to use or show the the researcher.

Note

The DSI package is also responsible for creating connections to the server (datashield.login), handling workspaces (datashield.workspaces) and many other stuff, this is not connected to package development.

Communication types

With DSI there are two different types of communications we can perform, one telling the server to assign the result of a function to a variable, the other telling the server to return the result of a function. It is important to notice that not any function can be used; we, as package developers, will develop a set of functions on the server package and each of them will only be available to be used one way or another, therefore, this information will be specified on the server package (see DATASHIELD file).

Given that these functions communicate from the client package to the server package, they will only be used on the client package. It is important to note that to function they will require the information to which servers to talk to, more on this in a second (The datasources object).

Assign functions

Assign functions, as the name implies, will assign something to the R session that is running on the server. For example if we have loaded a data table, we can write a function that takes the first two columns and assigns it to a new variable.

sequenceDiagram
    Client->>+Server: Run `helloWorld()` and assign the results to a variable named "hello"
    Server-->>-Client: Done deal boss (but in silence)

Please note that after running an assign function, the client does not recieve anything, hence (but in silence) on the diagram.

Code-wise this will look like this

DSI::datashield.assign.expr(
  conns = conns,
  symbol = "hello",
  expr = call("helloWorld()")
)

We are passing to the DSI::datashield.assign.expr function 1 the conns object (more on this here The datasources object), 2 the variable where the result will be assigned on the server and 3 the function whose result will be assigned to the new variable.

Aggregate functions

Aggregate functions will run a function on the server and the result will be sent to the client. Beware, a poorly written function on the server might be able to return raw data to the client, completely defeating the point of using DataSHIELD. The process in this case looks like this

sequenceDiagram
    Client->>+Server: Run `helloWorld()` and return me the results of it
    Server-->>-Client: Done deal boss (sends data to the client)

Code-wise this will look like this

DSI::datashield.aggregate(
  conns = conns,
  expr = call("helloWorld()")
)

Here in this case the function is returning something, so we can assign it myValue <- DSI::datashield.aggregate(...) and use the value on the client, maybe we want to plot it or just send it to the user directly.

DATASHIELD file

As we have just seen, from the client we can tell the server to run functions and assign the results to an object on the server, or to send the results to the client. However, we as developer can specify which functions are allowed to return values to the client, and which will assign it’s results to the server R session.

To do that we will add a file named DATASHIELD on our server. More precisely it will be located as follows

.
└── dsServer/
    ├── R/
    │   └── ds.function.R
    ├── inst/
    │   └── DATASHIELD
    ├── DESCRIPTION
    └── NAMESPACE

This file is a simple plain text file with the following structure

AssignMethods:
  myAssingFunc1DS,
  myAssingFunc2DS
AggregateMethods:
  myAggregateFunc1DS,
  myAggregateFunc2DS

Alternatively, we can also include this information into the DESCRIPTION file of our server R package following the same formatting. Refer to dsBase DESCRIPTION file for an example.

Note

Keeping this information on a separate file guarantees a cleaner DESCRIPTION file, specially relevant for extense packages with many functions. Feel free to use any of the two methods at your own judgment.

The datasources object

We have been talking about communicating between the client and the server and telling the server to run things. However, we need some object that contains the information about which server are we talking to. This oject is created by the login functions on DataSHIELD. Here a brief example

builder <- DSI::newDSLoginBuilder()
builder$append(server = "myServer",
               url = "https://myDSserver.org",
               user = "ds_user", password = "password")
logindata <- builder$build()
connections <- DSI::datashield.login(logins = logindata)

This connection object is what our DSI functions will use to know to which servers to talk. For that reason, on the client functions that we write we will put an argument to pass it. Typically this argument is named datasources and defaults to NULL. It can default to NULL because DSI offers a helper function to search for this object on the client R environment and use it; nevertheless it is important that we can let the user specify which connection object they want to use, maybe someone creates two different connection on the same session and wants to use a certain one with different functions.

Here is how a typical client function starts and how to use this helper function from DSI

ds.Function <- function(..., datasources = NULL){
  
  if (is.null(datasources)) {
    datasources <- DSI::datashield.connections_find()
  }

  # Rest of the function

}

You can see that the function DSI::datashield.connections_find will find the connection object on the environment and use it during the rest of the function call.

Exercise 1

  1. Create two R packages: dsClient and dsServer

Make your life easier by using usethis::create_package("dsClient")

  1. On the dsClient package create a function named ds.assignString that takes as input argument a string, and sends it to the assign function assignStringDS from the dsServer package.

Add #' @export on top of your new R function to export it. Then, run devtools::document() so it gets added to the NAMESPACE file of your package automatically, this will allow you to library(dsClient) and be able to call ds.assignString.

  1. On the dsServer package create a function named assignStringDS that takes as input argument a string, and returns that string.

  2. Declare the assignStringDS as a DataSHIELD assign function.