1. DataSHIELD packages structure
DataSHIELD packages have a very particular structure, that is in first instance defined by being a combination of two package that act simultaniously. The reason for this is that when we analyze data through DataSHIELD we have two actors, 1 the client (who does the analysis) and 2 the server (where the data is stored), given the “taking the analysis to the data” approach of DataSHIELD we can easily understand that one package will be communicating from the analyst laptop to the package on the data server.
We can illustrate this interaction with a simple figure.
Looking at this figure we can clearly identify the two distinct actors. Each of them will use a different analysis package. For that reason when we develop a DataSHIELD package, we are actually developing a collection of two packages that will work in tandem.
Naming conventions
As with many different software and architectures, there are a set of (somtimes un-) written rules that developers stick to. In this case, no official writting is available, so this document can serve as a starting point.
First and foremost, use camelCase naming convention.
Naming the client package
The naming for the client package will follow this rule
dsXxxClient
Where Xxx
is the actual name. E.g. a metabolomics exposome package could be named dsMetabolomicsClient
.
Naming of the server package
The server package will use the same Xxx
and be named as
dsXxx
Following on the metabolomics example we would have dsMetabolomics
.
Function naming on the client
The naming of functions on the client package will follow
ds.Xxx
E.g. in metabolomics ds.ComputeMetaboScore
.
Function naming on the server
Most times a client function will call a specific one on the server, therefore we name them the same.
XxxDS
E.g. in metabolomics ComputeMetaboScoreDS
.
Please note this is not a strict rule, it is just a good practice. If all developers stick to ~ similar ~ rules, the feel for the actual users will be smoother, which will improve engagement on the project.
DSI: Communicating from client to server
Now that we understand why two packages are required and how to name them, let’s see which tool we have to perform communications between them. The tool that we will use is called DSI, short for “DataSHIELD Interface”. This tool will allow us to send queries from the packages on the client to the server ones, as well as recieve the server response with the results that we want to use or show the the researcher.
The DSI package is also responsible for creating connections to the server (datashield.login
), handling workspaces (datashield.workspaces
) and many other stuff, this is not connected to package development.
Communication types
With DSI there are two different types of communications we can perform, one telling the server to assign the result of a function to a variable, the other telling the server to return the result of a function. It is important to notice that not any function can be used; we, as package developers, will develop a set of functions on the server package and each of them will only be available to be used one way or another, therefore, this information will be specified on the server package (see DATASHIELD file).
Given that these functions communicate from the client package to the server package, they will only be used on the client package. It is important to note that to function they will require the information to which servers to talk to, more on this in a second (The datasources object).
Assign functions
Assign functions, as the name implies, will assign something to the R session that is running on the server. For example if we have loaded a data table, we can write a function that takes the first two columns and assigns it to a new variable.
Please note that after running an assign function, the client does not recieve anything, hence (but in silence) on the diagram.
Code-wise this will look like this
::datashield.assign.expr(
DSIconns = conns,
symbol = "hello",
expr = call("helloWorld()")
)
We are passing to the DSI::datashield.assign.expr
function 1 the conns
object (more on this here The datasources object), 2 the variable where the result will be assigned on the server and 3 the function whose result will be assigned to the new variable.
Aggregate functions
Aggregate functions will run a function on the server and the result will be sent to the client. Beware, a poorly written function on the server might be able to return raw data to the client, completely defeating the point of using DataSHIELD. The process in this case looks like this
Code-wise this will look like this
::datashield.aggregate(
DSIconns = conns,
expr = call("helloWorld()")
)
Here in this case the function is returning something, so we can assign it myValue <- DSI::datashield.aggregate(...)
and use the value on the client, maybe we want to plot it or just send it to the user directly.
DATASHIELD file
As we have just seen, from the client we can tell the server to run functions and assign the results to an object on the server, or to send the results to the client. However, we as developer can specify which functions are allowed to return values to the client, and which will assign it’s results to the server R session.
To do that we will add a file named DATASHIELD
on our server. More precisely it will be located as follows
.
└── dsServer/
├── R/
│ └── ds.function.R
├── inst/
│ └── DATASHIELD
├── DESCRIPTION
└── NAMESPACE
This file is a simple plain text file with the following structure
AssignMethods:
myAssingFunc1DS,
myAssingFunc2DS
AggregateMethods:
myAggregateFunc1DS,
myAggregateFunc2DS
Alternatively, we can also include this information into the DESCRIPTION
file of our server R package following the same formatting. Refer to dsBase
DESCRIPTION
file for an example.
Keeping this information on a separate file guarantees a cleaner DESCRIPTION
file, specially relevant for extense packages with many functions. Feel free to use any of the two methods at your own judgment.
The datasources
object
We have been talking about communicating between the client and the server and telling the server to run things. However, we need some object that contains the information about which server are we talking to. This oject is created by the login functions on DataSHIELD. Here a brief example
<- DSI::newDSLoginBuilder()
builder $append(server = "myServer",
builderurl = "https://myDSserver.org",
user = "ds_user", password = "password")
<- builder$build()
logindata <- DSI::datashield.login(logins = logindata) connections
This connection
object is what our DSI
functions will use to know to which servers to talk. For that reason, on the client functions that we write we will put an argument to pass it. Typically this argument is named datasources
and defaults to NULL
. It can default to NULL
because DSI
offers a helper function to search for this object on the client R environment and use it; nevertheless it is important that we can let the user specify which connection object they want to use, maybe someone creates two different connection on the same session and wants to use a certain one with different functions.
Here is how a typical client function starts and how to use this helper function from DSI
<- function(..., datasources = NULL){
ds.Function
if (is.null(datasources)) {
<- DSI::datashield.connections_find()
datasources
}
# Rest of the function
}
You can see that the function DSI::datashield.connections_find
will find the connection object on the environment and use it during the rest of the function call.
Exercise 1
- Create two R packages:
dsClient
anddsServer
Make your life easier by using usethis::create_package("dsClient")
- On the
dsClient
package create a function namedds.assignString
that takes as input argument a string, and sends it to the assign functionassignStringDS
from thedsServer
package.
Add #' @export
on top of your new R function to export it. Then, run devtools::document()
so it gets added to the NAMESPACE
file of your package automatically, this will allow you to library(dsClient)
and be able to call ds.assignString
.
On the
dsServer
package create a function namedassignStringDS
that takes as input argument a string, and returns that string.Declare the
assignStringDS
as a DataSHIELD assign function.