---
title: "exprDesign"
author: "Lluís Revilla Sancho"
date: "`r Sys.Date()`"
output:
  html_document:
    fig_caption: true
    code_folding: show
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{exprDesign}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r knitsetup, message=FALSE, warning=FALSE, include=FALSE}
knitr::opts_knit$set(root.dir = ".")
knitr::opts_chunk$set(collapse = TRUE, warning = TRUE)
set.seed(445)
library("experDesign")
```

# Introduction

This package was developed to help prepare some samples to be send to a 
facility. It assumes that you have collected the samples and the information 
but you still need to do the experiment in several batches due to technical or 
practical limitations. The question that tries to answer is: 

> Which samples go with each batch?

Of all the possible combinations of samples, it looks for the combination which 
minimizes the differences between each subgroup according to the following rules:

 - If the variable is categorical it tires to randomize the variable across the subgroups.
 - If the variable is numeric it tries to distribute evenly following the original distribution of values.
 - If there are `NA` (not available values) it looks to distribute them randomly.

Even with this measures you might end up with some batch effect due to:
 - Confounding variables not provided for their randomization on the batches
  Sometimes due to being unknown, impossible to measure.
 - Lack of [replicates](https://en.wikipedia.org/wiki/Replication_(statistics))(samples with the same conditions)
  If you can't provide new replicates, aim to provide more technical replicates.
  Technical replicates mean reanalyzing the same sample twice or more, the more samples with technical replicates the more accurate your measures will be and easier to avoid or detect batch effects.
 - Processing
  If there is a change on the methodology, or you pause and resume later the sample collection there might be changes on the outcome due to external factors. 

# Previous work

Before building this package I would like to give credit to those that made 
also efforts in this direction:


The CRAN task View of [Experimental Design](https://CRAN.R-project.org/view=ExperimentalDesign) includes many packages relevant for designing an experiment before collecting data, but none of them provides how to manage them once the samples are already collected.

Two packages allow to distribute the samples on batches:

- The [OSAT](https://bioconductor.org/packages/release/bioc/html/OSAT.html) package handles categorical 
variables but not numeric data. It doesn't work with our data.

 - The [minDiff](https://github.com/m-Py/minDiff) package reported in [Stats.SE](https://stats.stackexchange.com/a/326015/105234), handles both 
numeric and categorical data. But it can only optimize for two nominal criteria.
It doesn't work for our data.

 - The [Omixer](https://bioconductor.org/packages/Omixer/) package handles both 
numeric and categorical data (converting categorical variables to numeric). But both the same way either Pearson's Chi-squared Test if there are few samples or Kendall's correlation. It does allow to protect some spots from being used.

If you are still designing the experiment and do not have collected any data [DeclareDesign](https://cran.r-project.org/package=DeclareDesign) might be relevant for you.

Question in [Bioinformatics.SE](https://bioinformatics.stackexchange.com/q/4765/48) I made before developing the package.


# Design of Experiment {#DoE}

Imagine you have some samples already collected and you want to distributed them in batches:
```{r experDesign_setup}
library("experDesign")
metadata <- expand.grid(height = seq(60, 80, 5), 
                        weight = seq(100, 300, 50),
                        sex = c("Male","Female"))
head(metadata, 15)
```

First we can check the data to see if there are any concerns regarding the categories:

```{r check_data}
check_data(metadata)
```

There are none confounding variables in this artificial dataset (see the examples, and you'll find some).
However, if you block incorrectly and end up with a group in a single batch we will end up with batch effect.

In order to avoid this `design()` helps you assign each sample to a batch (in this case each batch has 24 samples at most). 
First we can explore the number of samples and the number of batches:

```{r size}
size_data <- nrow(metadata)
size_batch <- 24
(batches <- optimum_batches(size_data, size_batch))
# So now the best number of samples for each batch is less than the available
(size <- optimum_subset(size_data, batches))
# The distribution of samples per batch
sizes_batches(size_data, size, batches)
```

Note that instead of using a whole batch and then leave a single sample on the third distributes all the samples in the three batches that will be needed.

# Randomization

We can directly look for the distribution of the samples given our max number of samples per batch:

```{r design}
desi <- design(metadata, size_batch)
# It is a list but we can convert it to a vector with:
batch_names(desi)
```

Naively one would either fill some batches fully or distribute them not evenly 
(the  first 17 packages together, the next 17 and so on). This solution ensures 
that the data is randomized. For more random distribution you can increase the number of iterations performed to calculate this distribution.

# Randomization and replicates

If you need space for replicates to control for batch effect you can use:

```{r replicates}
repli <- replicates(metadata, size_batch, 5)
lengths(repli)
repli
```

Which seeks as controls the most diverse values and adds them to the samples 
distribution. Note that if the sample is already present on that batch is not added again, that's why the number of samples per batch is different from the design without replicates.

# Layout

We can analyze how these samples would be distributed in a layout of 6x4:

```{r spatial}
spati <- spatial(repli, metadata, rows = LETTERS[1:6], columns = 1:4)
head(spati)
```

# Report for easy on field usage

We can add the batches to the initial data with `inspect()`:

```{r report}
report <- inspect(repli, metadata)
report2 <- inspect(spati, report, index_name = "position")
head(report2)
```

And now we can see the batch and position of each sample


# Compare indices

If you have two indices you can also compare them:

```{r compare_index}
desi2 <- create_subset(nrow(metadata), size_batch)
compare_index(metadata, desi, desi2)
```

As many of the variables for multiple subsets are negative it shows that `desi` is better.

# Unbalanced setting

In the previous case the data was mostly balanced (check it out in the `orig` object) 
but let's create an unbalanced dataset to check it.

```{r unbalanced}
n <- 99
samples <- 100
unbalanced <- data.frame(Classroom = rep(c("A", "B"), each = samples/2),
                         Sex = c(rep("M", n), rep("F", samples-n)),
                         Age = rnorm(samples, mean = 25, sd = 3))
table(unbalanced[, 1:2])
```

In this dataset there is only one female, resulting in a classroom full of males.
Age is independent of the sex or classroom.

```{r unbalanced_design}
i <- design(unbalanced, 15)
evaluation <- evaluate_index(i, unbalanced)
# Mean entropy en each subset
rowMeans(evaluation["entropy", , ])
# Original entropy on the dataset
evaluate_orig(unbalanced)["entropy", ]
# Dispersion of the entropy
apply(evaluation["entropy", , ], 1, sd)
```

We can see that in this simple case where a single variable has all the other cases we approximately reached the same entropy levels. 

# Quality check

If you need a subset with the samples that are more diverse you can use the 
following function:

```{r QC}
data(survey, package = "MASS") 
head(survey)
samples <- extreme_cases(survey, size = 10)
survey[samples, ]
```

You can also test a given index with `check_index()`:

```{r check_index}
check_index(unbalanced, i)
```

Each row has information about how accurate is a given variable to the samples available (on this case `unbalanced`). 
Some variables are distributed more randomly than others on this index.

If we are not satisfied we could `design()` a new index increasing the iterations to obtain a potentially better distribution. 
If you want a good stratified randomization you should increase the iterations used 10 fold. 

# Internals


We could check all the combinations to select those that allow us to do this comparison.
But as this would be too long with `experDesign` we can try to find the combination with the best design by comparing each combination with the original according to multiple statistics.

```{r internal, fig.show='hold'}
# To reduce the variables used:
omit <- c("Wr.Hnd", "NW.Hnd", "Fold", "Pulse", "Clap", "Exer", "Height", "M.I")
(keep <- colnames(survey)[!colnames(survey) %in% omit])
head(survey[, keep])

# Set a seed for reproducibility
# Looking for groups at most of 70 samples.
index <-  create_subset(nrow(survey), size_subset = 70)
index
```

We can measure how does this index does:

```{r check_index1}
score_index1 <- check_index(survey[, keep], index)
score_index1
```
These values come from calculating the difference between the original data and the samples for each subset for the median, mean, mad, NA, entropy and independence (`chisq.test()` p.value).

On themselves these values have no meaning, but internally they are used to compare with other possible index:

```{r design2, warning=FALSE}
index2 <- create_subset(nrow(survey), size_subset = 70)
```

Then it compares to the previous index

```{r check_index2}
score_index2 <- check_index(survey[, keep], index2)
sum(rowMeans(abs(score_index2-score_index1)))
```

If this score is lower than the previous one the new index is kept.

This is done similarly for the spatial search, which adds two new categorical variables with the position of the samples before calculating any statistics.

# SessionInfo

```{r sessioninfo}
sessionInfo()
```