BinMat: Processes Binary Data Obtained from Fragment Analysis

Clarke van Steenderen

2022-03-05



Department of Zoology and Entomology

The Centre for Biological Control (https://www.ru.ac.za/centreforbiologicalcontrol/)

Rhodes University, Grahamstown, Eastern Cape, South Africa

e-mail: vsteenderen@gmail.com



The idea behind this package was taken from my M.Sc. project
A genetic analysis of the species and intraspecific lineages of Dactylopius Costa (Hemiptera:Dactylopiidae)
See the publication in the Biological Control journal, and the thesis on Research Gate. The GUI version of this package is available on the R Shiny online server, or it is accessible via GitHub by typing:

shiny::runGitHub("BinMat", "clarkevansteenderen")

into the console in R.



OVERVIEW

Processing and visualising trends in the binary data obtained from fragment analysis methods in molecular biology can be a time-consuming and cumbersome process, and typically entails complex workflows. The BinMat package automates the analysis pipeline on one platform, and was written to process binary matrices derived from dominant marker genetic analyses. The program consolidates replicate sample pairs in a dataset into consensus reads, produces summary statistics (peaks and error rates), and allows the user to visualise their data as non-metric multidimensional scaling (nMDS) plots and hierarchical clustering trees.


UPLOADING DATA

Data type 1

Binary data containing replicate pairs that need be consolidated should be in the following example format, uploaded as CSV file (use the read.csv() function):

Locus 1 Locus 2 Locus 3 Locus 4 Locus 5
Sample A replicate 1 0 0 1 1 1
Sample A replicate 2 0 0 1 1 1
Sample B replicate 1 1 1 0 0 0
Sample B replicate 2 0 1 0 0 1

Note that replicate pairs must be directly underneath each other, and that each sample needs to have a unique name.

The following conditions are applied to binary data replicates:

  1. A 0 and a 1 produces a “?”

  2. A 0 and a 0 produces a “0”

  3. A 1 and a 1 produces a “1”

The consolidated output for the above example would thus be:

Locus 1 Locus 2 Locus 3 Locus 4 Locus 5
Sample A replicate 1 & 2 0 0 1 1 1
Sample B replicate 1 & 2 ? 1 0 0 ?

Data type 2

Binary data that has already been consolidated must have grouping information in the second column, as shown in the example below. This should also be in CSV format. This matrix can be used to create an nMDS plot coloured by group.

Group Locus 1 Locus 2 Locus 3 Locus 4 Locus 5
Sample A Africa ? 0 1 1 1
Sample B Asia 0 0 1 1 ?
Sample C Europe 1 ? 0 0 0
Sample D USA ? 1 0 ? 1

FUNCTIONS

Function Details
check.data() Checks for unwanted values (other than 1, 0, and ?).
consolidate() Reads in a binary matrix comprising replicate pairs and consolidates each pair into a consensus read. For each replicate pair at each locus, 1 & 1 -> 1 (shared presence), 0 & 0 -> 0 (shared absence), 0 & 1 -> ? (ambiguity).
errors() Calculates the Jaccard and Euclidean error rates for the dataset. Jaccard’s error does not take shared absences of bands as being biologically meaningful. JE = (f10 + f01)/(f10 + f01 + f11) and EE = (f10 + f01)/(f10 + f01 + f11 + f00). At each locus, f01 and f10 indicates a case where a 0 was present in one replicate, and a 1 in the other. f11 indicates the shared presence of a band in both replicates, and f00 indicates a shared absence.
group.names() Returns group names in the uploaded consolidated binary data. This will help in knowing which colours are assigned to which group name.
nmds() Creates an nMDS plot from a consolidated binary matrix with grouping information. Colours of plotted points need to be specified.
peak.remove() Removes samples with a peak number less than a specified value.
peaks.consolidated() Returns total, maximum, and minimum number of peaks in a consolidated binary matrix.
peaks.original() Returns total, maximum, and minimum number of peaks in the binary matrix comprising replicates.
scree() Creates a scree plot for the nMDS. This indicates the optimum number of dimensions to use to minimise the stress value. The stress value is indicated by a red dotted line at 0.15. Values equal to or below this are considered acceptable.
shepard() Creates a Shepard plot for the nMDS. This indicates the ‘goodness of fit’ of the original distance matrix vs the ordination representation. A high R-squared value is favourable.
upgma() Creates a UPGMA hierarchical clustering tree, with a specified number of bootstrap repetitions.

METHODS

Euclidean error = (f01 + f10) / (f01 + f10 + f00 + f11)

The Euclidean error rate includes the shared absence of a band (f00).

Jaccard error = (f01 + f10) / (f01 + f10 + f11)

The Jaccard error rate does not take shared absences of bands into account. The error rate will thus be inflated compared to the Euclidean.

In the formulae above, ‘f’ refers to the frequency of each combination (i.e. ‘f01 + f10’ means the sum of all the occurences of a zero and a one, and a one and a zero). An error value is calculated for each replicate pair, and an average obtained representing the whole dataset.

The error rates for the example samples below would be:


Sample X rep 1: 0 1 1 0

Sample X rep 2: 1 0 1 0


Euclidean error = (1+1) / (1+1+1+1) = 2/4 = 0.5

Jaccard error = (1+1) / (1+1+1) = 2/3 = 0.67


Sample Y rep 1: 1 1 1 0

Sample Y rep 2: 1 1 0 0


Euclidean error = (1) / (1+1+2) = 1/4 = 0.25

Jaccard error = (1) / (1+2) = 1/3 = 0.33


Average Euclidean error = (0.5+0.25) / 2 = 0.38

Average Jaccard error = (0.67+0.33) / 2 = 0.5


WORKED EXAMPLE

There are four sample data sets embedded in the package:

  1. BinMatInput_reps
  2. BinMatInput_ordination
  3. bunias_orientalis
  4. nymphaea

The first two are small made-up datasets, and illustrate how the BinMat functions are implemented:

library(BinMat)

data1 = BinMatInput_reps
data2 = BinMatInput_ordination

# data1 contains all the replicate pairs that need to be consolidated into a consensus output
# data2 contains a consolidated binary matrix with grouping information in the second column

# Check the data for unwanted values
check.data(data1)
## None found.
# Get information about peak numbers for all replicates
peaks.original(data1)
##                Metric   Value
## 1 Average no. peaks:   8.7500
## 2                sd:   1.9086
## 3    Max. no. peaks:  12.0000
## 4    Min. no. peaks:   6.0000
## 5          No. loci:  14.0000
# Consolidate the replicate pairs in the matrix
cons = consolidate(data1)
# View the original matrix 
data1
##   Sample X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
## 1     A1  0  0  1  1  1  0  1  0  0   1   1   1   0   1
## 2     A2  0  1  1  1  1  0  1  0  1   1   1   1   0   1
## 3     B1  0  0  1  1  1  1  1  0  0   1   1   1   1   1
## 4     B2  1  1  1  1  1  0  1  1  1   1   1   1   0   1
## 5     C1  0  1  1  1  1  0  0  0  1   1   1   1   0   0
## 6     C2  1  0  1  1  0  0  0  1  0   1   1   0   0   0
## 7     D1  1  0  0  1  1  0  1  0  0   1   0   1   0   1
## 8     D2  1  1  1  1  1  0  1  0  1   1   0   1   0   0
# View the consolidated output
cons
##       X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
## A1+A2  0  ?  1  1  1  0  1  0  ?   1   1   1   0   1
## B1+B2  ?  ?  1  1  1  ?  1  ?  ?   1   1   1   ?   1
## C1+C2  ?  ?  1  1  ?  0  0  ?  ?   1   1   ?   0   0
## D1+D2  1  ?  ?  1  1  0  1  0  ?   1   0   1   0   ?
# Get the Jaccard and Euclidean error rates
errors(cons)
##                     Metric  Value
## 1 Average Euclidean Error: 0.3214
## 2 Euclidean error St. dev: 0.1368
## 3         Average Jaccard: 0.4071
## 4    Jaccard error St.dev: 0.1639
# Get information about the peak numbers in the consolidated matrix
peaks.consolidated(cons)
##                Metric   Value
## 1 Average no. peaks:   6.5000
## 2                sd:   1.9149
## 3    Max. no. peaks:   8.0000
## 4    Min. no. peaks:   4.0000
## 5          No. loci:  14.0000
# Create a hierarchical clustering tree using the UPGMA method
clustTree = upgma(cons, size = 0.6)
## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.57)... Done.
## Bootstrap (r = 0.64)... Done.
## Bootstrap (r = 0.79)... Done.
## Bootstrap (r = 0.86)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.07)... Done.
## Bootstrap (r = 1.14)... Done.
## Bootstrap (r = 1.29)... Done.
## Bootstrap (r = 1.36)... Done.

# Find samples with peaks less than a specified threshold value, and return the new, filtered data set
filtered_data1 = peak.remove(cons, thresh = 6)
## Number of samples removed: 
## 2
## This was/these were: 
## C1+C2
## D1+D2
filtered_data1
##       Sample X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
## A1+A2      0  ?  1  1  1  0  1  0  ?   1   1   1   0   1
## B1+B2      ?  ?  1  1  1  ?  1  ?  ?   1   1   1   ?   1
# data2 contains an already-consolidated matrix with grouping information. This is used to create a scree, shepard, and an nMDS plot.

# # Find samples with peaks less than a specified threshold value, and return the new, filtered data set
filtered_data2 = peak.remove(data2, thresh = 7)
## Number of samples removed: 
## 4
## This was/these were: 
## 1
## 6
## 9
## 10
filtered_data2
##    Sample     Group X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
## 2       B    Africa  0  1  1  1  1  0  1  0  1   1   1   1   ?   1
## 3       C    Africa  0  ?  1  1  1  1  1  0  0   1   1   1   1   1
## 4       D    Africa  1  1  1  1  1  0  1  1  1   ?   1   1   0   1
## 5       E    Europe  ?  1  1  1  1  0  ?  0  1   1   1   1   0   0
## 7       G    Europe  1  0  0  1  1  0  1  0  0   1   0   1   0   1
## 8       H    Europe  1  1  1  1  1  ?  1  0  1   1   0   1   0   0
## 11      K Australia  0  1  1  1  1  0  0  0  1   1   1   1   0   1
## 12      L Australia  0  0  1  1  ?  0  0  1  1   1   1   ?   0   1
# Get the names of the groups specified in the second column
group.names(data2)
## [1] "Africa"    "Australia" "Europe"
# Create an object containing colours for each group
# Colours: Africa = red, Australia = blue, Europe = dark green
clrs = c("red", "blue", "darkgreen")
# Create a scree plot to check how the number of dimensions for an nMDS plot will affect the resulting stress values
scree(data2)
## initial  value 7.158898 
## iter   5 value 4.062583
## iter  10 value 3.951154
## iter  15 value 3.911452
## final  value 3.873089 
## converged
## initial  value 17.256889 
## iter   5 value 13.318548
## iter  10 value 10.281815
## iter  15 value 9.885853
## iter  20 value 9.230169
## iter  25 value 8.712386
## iter  30 value 8.413392
## final  value 8.314316 
## converged
## initial  value 23.145184 
## iter   5 value 15.519401
## iter  10 value 15.224397
## iter  10 value 15.216857
## iter  10 value 15.210646
## final  value 15.210646 
## converged
## initial  value 40.284721 
## iter   5 value 33.922426
## final  value 32.437297 
## converged

## NULL
# Create a shepard plot showing the goodness of fit for the original data vs the ordination data
shepard(data2)
## initial  value 23.145184 
## iter   5 value 15.519401
## iter  10 value 15.224397
## iter  10 value 15.216857
## iter  10 value 15.210646
## final  value 15.210646 
## converged
## NULL
# Create an nMDS plot for the data. Default dimension is 2
nmds(data2, colours = clrs, labs = TRUE)

## initial  value 23.145184 
## iter   5 value 15.519401
## iter  10 value 15.224397
## iter  10 value 15.216857
## iter  10 value 15.210646
## final  value 15.210646 
## converged

REAL-WORLD WORKED EXAMPLE

The bunias_orientalis and nymphaea datasets are real-world AFLP and ISSR files, respectively, and have already been consolidated by BinMat. These produce nMDS plots, as shown below:

Bunias orientalis

bunias = bunias_orientalis
group.names(bunias)
## [1] "exotic"   "invasive" "native"
nmds(bunias, labs = FALSE, include_ellipse = TRUE, legend_pos = "right")
## initial  value 13.797435 
## iter   5 value 11.860061
## iter  10 value 11.548672
## iter  10 value 11.537241
## final  value 11.507572 
## converged

Nymphaea