Department of Zoology and Entomology
The Centre for Biological Control (https://www.ru.ac.za/centreforbiologicalcontrol/)
Rhodes University, Grahamstown, Eastern Cape, South Africa
e-mail: vsteenderen@gmail.com
The idea behind this package was taken from my M.Sc. project
“A genetic analysis of the species and intraspecific lineages of Dactylopius Costa (Hemiptera:Dactylopiidae)”
See the publication in the Biological Control journal, and the thesis on Research Gate. The GUI version of this package is available on the R Shiny online server, or it is accessible via GitHub by typing:
shiny::runGitHub("BinMat", "clarkevansteenderen")
into the console in R.
Processing and visualising trends in the binary data obtained from fragment analysis methods in molecular biology can be a time-consuming and cumbersome process, and typically entails complex workflows. The BinMat package automates the analysis pipeline on one platform, and was written to process binary matrices derived from dominant marker genetic analyses. The program consolidates replicate sample pairs in a dataset into consensus reads, produces summary statistics (peaks and error rates), and allows the user to visualise their data as non-metric multidimensional scaling (nMDS) plots and hierarchical clustering trees.
Data type 1
Binary data containing replicate pairs that need be consolidated should be in the following example format, uploaded as CSV file (use the read.csv() function):
Locus 1 | Locus 2 | Locus 3 | Locus 4 | Locus 5 | |
---|---|---|---|---|---|
Sample A replicate 1 | 0 | 0 | 1 | 1 | 1 |
Sample A replicate 2 | 0 | 0 | 1 | 1 | 1 |
Sample B replicate 1 | 1 | 1 | 0 | 0 | 0 |
Sample B replicate 2 | 0 | 1 | 0 | 0 | 1 |
Note that replicate pairs must be directly underneath each other, and that each sample needs to have a unique name.
The following conditions are applied to binary data replicates:
A 0 and a 1 produces a “?”
A 0 and a 0 produces a “0”
A 1 and a 1 produces a “1”
The consolidated output for the above example would thus be:
Locus 1 | Locus 2 | Locus 3 | Locus 4 | Locus 5 | |
---|---|---|---|---|---|
Sample A replicate 1 & 2 | 0 | 0 | 1 | 1 | 1 |
Sample B replicate 1 & 2 | ? | 1 | 0 | 0 | ? |
Data type 2
Binary data that has already been consolidated must have grouping information in the second column, as shown in the example below. This should also be in CSV format. This matrix can be used to create an nMDS plot coloured by group.
Group | Locus 1 | Locus 2 | Locus 3 | Locus 4 | Locus 5 | |
---|---|---|---|---|---|---|
Sample A | Africa | ? | 0 | 1 | 1 | 1 |
Sample B | Asia | 0 | 0 | 1 | 1 | ? |
Sample C | Europe | 1 | ? | 0 | 0 | 0 |
Sample D | USA | ? | 1 | 0 | ? | 1 |
Function | Details |
---|---|
check.data() | Checks for unwanted values (other than 1, 0, and ?). |
consolidate() | Reads in a binary matrix comprising replicate pairs and consolidates each pair into a consensus read. For each replicate pair at each locus, 1 & 1 -> 1 (shared presence), 0 & 0 -> 0 (shared absence), 0 & 1 -> ? (ambiguity). |
errors() | Calculates the Jaccard and Euclidean error rates for the dataset. Jaccard’s error does not take shared absences of bands as being biologically meaningful. JE = (f10 + f01)/(f10 + f01 + f11) and EE = (f10 + f01)/(f10 + f01 + f11 + f00). At each locus, f01 and f10 indicates a case where a 0 was present in one replicate, and a 1 in the other. f11 indicates the shared presence of a band in both replicates, and f00 indicates a shared absence. |
group.names() | Returns group names in the uploaded consolidated binary data. This will help in knowing which colours are assigned to which group name. |
nmds() | Creates an nMDS plot from a consolidated binary matrix with grouping information. Colours of plotted points need to be specified. |
peak.remove() | Removes samples with a peak number less than a specified value. |
peaks.consolidated() | Returns total, maximum, and minimum number of peaks in a consolidated binary matrix. |
peaks.original() | Returns total, maximum, and minimum number of peaks in the binary matrix comprising replicates. |
scree() | Creates a scree plot for the nMDS. This indicates the optimum number of dimensions to use to minimise the stress value. The stress value is indicated by a red dotted line at 0.15. Values equal to or below this are considered acceptable. |
shepard() | Creates a Shepard plot for the nMDS. This indicates the ‘goodness of fit’ of the original distance matrix vs the ordination representation. A high R-squared value is favourable. |
upgma() | Creates a UPGMA hierarchical clustering tree, with a specified number of bootstrap repetitions. |
Euclidean error = (f01 + f10) / (f01 + f10 + f00 + f11)
The Euclidean error rate includes the shared absence of a band (f00).
Jaccard error = (f01 + f10) / (f01 + f10 + f11)
The Jaccard error rate does not take shared absences of bands into account. The error rate will thus be inflated compared to the Euclidean.
In the formulae above, ‘f’ refers to the frequency of each combination (i.e. ‘f01 + f10’ means the sum of all the occurences of a zero and a one, and a one and a zero). An error value is calculated for each replicate pair, and an average obtained representing the whole dataset.
The error rates for the example samples below would be:
Sample X rep 1: 0 1 1 0
Sample X rep 2: 1 0 1 0
Euclidean error = (1+1) / (1+1+1+1) = 2/4 = 0.5
Jaccard error = (1+1) / (1+1+1) = 2/3 = 0.67
Sample Y rep 1: 1 1 1 0
Sample Y rep 2: 1 1 0 0
Euclidean error = (1) / (1+1+2) = 1/4 = 0.25
Jaccard error = (1) / (1+2) = 1/3 = 0.33
Average Euclidean error = (0.5+0.25) / 2 = 0.38
Average Jaccard error = (0.67+0.33) / 2 = 0.5
There are four sample data sets embedded in the package:
The first two are small made-up datasets, and illustrate how the BinMat functions are implemented:
library(BinMat)
data1 = BinMatInput_reps
data2 = BinMatInput_ordination
# data1 contains all the replicate pairs that need to be consolidated into a consensus output
# data2 contains a consolidated binary matrix with grouping information in the second column
# Check the data for unwanted values
check.data(data1)
## None found.
## Metric Value
## 1 Average no. peaks: 8.7500
## 2 sd: 1.9086
## 3 Max. no. peaks: 12.0000
## 4 Min. no. peaks: 6.0000
## 5 No. loci: 14.0000
# Consolidate the replicate pairs in the matrix
cons = consolidate(data1)
# View the original matrix
data1
## Sample X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
## 1 A1 0 0 1 1 1 0 1 0 0 1 1 1 0 1
## 2 A2 0 1 1 1 1 0 1 0 1 1 1 1 0 1
## 3 B1 0 0 1 1 1 1 1 0 0 1 1 1 1 1
## 4 B2 1 1 1 1 1 0 1 1 1 1 1 1 0 1
## 5 C1 0 1 1 1 1 0 0 0 1 1 1 1 0 0
## 6 C2 1 0 1 1 0 0 0 1 0 1 1 0 0 0
## 7 D1 1 0 0 1 1 0 1 0 0 1 0 1 0 1
## 8 D2 1 1 1 1 1 0 1 0 1 1 0 1 0 0
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
## A1+A2 0 ? 1 1 1 0 1 0 ? 1 1 1 0 1
## B1+B2 ? ? 1 1 1 ? 1 ? ? 1 1 1 ? 1
## C1+C2 ? ? 1 1 ? 0 0 ? ? 1 1 ? 0 0
## D1+D2 1 ? ? 1 1 0 1 0 ? 1 0 1 0 ?
## Metric Value
## 1 Average Euclidean Error: 0.3214
## 2 Euclidean error St. dev: 0.1368
## 3 Average Jaccard: 0.4071
## 4 Jaccard error St.dev: 0.1639
## Metric Value
## 1 Average no. peaks: 6.5000
## 2 sd: 1.9149
## 3 Max. no. peaks: 8.0000
## 4 Min. no. peaks: 4.0000
## 5 No. loci: 14.0000
## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.57)... Done.
## Bootstrap (r = 0.64)... Done.
## Bootstrap (r = 0.79)... Done.
## Bootstrap (r = 0.86)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.07)... Done.
## Bootstrap (r = 1.14)... Done.
## Bootstrap (r = 1.29)... Done.
## Bootstrap (r = 1.36)... Done.
# Find samples with peaks less than a specified threshold value, and return the new, filtered data set
filtered_data1 = peak.remove(cons, thresh = 6)
## Number of samples removed:
## 2
## This was/these were:
## C1+C2
## D1+D2
## Sample X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
## A1+A2 0 ? 1 1 1 0 1 0 ? 1 1 1 0 1
## B1+B2 ? ? 1 1 1 ? 1 ? ? 1 1 1 ? 1
# data2 contains an already-consolidated matrix with grouping information. This is used to create a scree, shepard, and an nMDS plot.
# # Find samples with peaks less than a specified threshold value, and return the new, filtered data set
filtered_data2 = peak.remove(data2, thresh = 7)
## Number of samples removed:
## 4
## This was/these were:
## 1
## 6
## 9
## 10
## Sample Group X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
## 2 B Africa 0 1 1 1 1 0 1 0 1 1 1 1 ? 1
## 3 C Africa 0 ? 1 1 1 1 1 0 0 1 1 1 1 1
## 4 D Africa 1 1 1 1 1 0 1 1 1 ? 1 1 0 1
## 5 E Europe ? 1 1 1 1 0 ? 0 1 1 1 1 0 0
## 7 G Europe 1 0 0 1 1 0 1 0 0 1 0 1 0 1
## 8 H Europe 1 1 1 1 1 ? 1 0 1 1 0 1 0 0
## 11 K Australia 0 1 1 1 1 0 0 0 1 1 1 1 0 1
## 12 L Australia 0 0 1 1 ? 0 0 1 1 1 1 ? 0 1
## [1] "Africa" "Australia" "Europe"
# Create an object containing colours for each group
# Colours: Africa = red, Australia = blue, Europe = dark green
clrs = c("red", "blue", "darkgreen")
# Create a scree plot to check how the number of dimensions for an nMDS plot will affect the resulting stress values
scree(data2)
## initial value 7.158898
## iter 5 value 4.062583
## iter 10 value 3.951154
## iter 15 value 3.911452
## final value 3.873089
## converged
## initial value 17.256889
## iter 5 value 13.318548
## iter 10 value 10.281815
## iter 15 value 9.885853
## iter 20 value 9.230169
## iter 25 value 8.712386
## iter 30 value 8.413392
## final value 8.314316
## converged
## initial value 23.145184
## iter 5 value 15.519401
## iter 10 value 15.224397
## iter 10 value 15.216857
## iter 10 value 15.210646
## final value 15.210646
## converged
## initial value 40.284721
## iter 5 value 33.922426
## final value 32.437297
## converged
## NULL
# Create a shepard plot showing the goodness of fit for the original data vs the ordination data
shepard(data2)
## initial value 23.145184
## iter 5 value 15.519401
## iter 10 value 15.224397
## iter 10 value 15.216857
## iter 10 value 15.210646
## final value 15.210646
## converged
## NULL
## initial value 23.145184
## iter 5 value 15.519401
## iter 10 value 15.224397
## iter 10 value 15.216857
## iter 10 value 15.210646
## final value 15.210646
## converged
The bunias_orientalis and nymphaea datasets are real-world AFLP and ISSR files, respectively, and have already been consolidated by BinMat. These produce nMDS plots, as shown below:
## [1] "exotic" "invasive" "native"
## initial value 13.797435
## iter 5 value 11.860061
## iter 10 value 11.548672
## iter 10 value 11.537241
## final value 11.507572
## converged