Tutorial on ML Ancestral State Reconstruction and Vizualization with Missing Data in R

R tutorial 4: Ancestral State Reconstruction (ML) and Visualization with Missing Data

Despite using matrices with incomplete codings is common in phylogenetics, only Mesquite natively allows to perform and display ancestral state reconstruction for incompletely coded discrete characters. Sadly, I yet have to find an R package that supports such feature (check out Tutorial 2 for a basic ASR in R). In the mean time, here is how you can quickly work around this problem.

Legend

Code block

Output block

Console command block

Tools needed

We would need the following packages:

package APE
package geiger
package ggplot2 (optional)
package ggtree (optional)

And the following files:

phylogenetic tree in newick format
discrete data in nexus format

Read in the phylogeny and the data matrix

Phylogeny in this example is a bipartitions file output from RAxML. The data matrix is a nexus file, exported from Mesquite in simplified format. We can load the files like that if they are in the working directory:

library(ape)
library(geiger)

## Registered S3 method overwritten by 'geiger':
##   method            from
##   unique.multiPhylo ape

# load tree and data
tree1 <- ladderize(read.tree("treefile.tre"))
data_nex <- read.nexus.data("matrix.nex")

Matrix was read into a list of vectors each corresponding to row of the original matrix (taxon). To be able to easily access a column of the matrix, I reformat it to data frame. The matrix contains multiple characters. Typically, a single character is reconstructed and vizualized at a time. As in the tutorial 2, I will extract a single character, with incomplete coding (number 42), from my matrix.

data_df <- as.data.frame(do.call(rbind, data_nex)) #reformat to data frame
char42 <- data_df[ ,42] #subdivide single character (42nd)
char42 #check out the codings

## t144 t143 t141 t140 t142 t138 t139 t134 t135 t137 t136 t133 t132 t131  t22  t23 
##    -    -    0    0    0    0    0    0    0    0    0    -    0    0    1    1 
##  t24  t27  t26  t25   t6   t5   t4   t3   t2   t1   t8  t21  t17  t18  t20  t19 
##    1    1    1    1    -    -    -    -    -    -    0    0    0    0    0    0 
##  t14  t15  t16  t10   t9  t11  t12  t13   t7 t130  t28  t30  t29  t31  t33  t32 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
##  t43  t45  t44  t46  t47  t34  t36  t35  t37  t38  t42  t41  t40  t39 t129 t127 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
## t128 t111 t117 t116 t115 t114 t113 t112 t118 t119 t109 t110 t126 t125 t124 t121 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
## t123 t120 t122 t105 t106 t104 t107 t108  t75  t74  t76  t77  t78 t103 t102  t82 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
##  t83  t81  t80  t79  t90  t87  t88  t89  t84  t85  t86 t101  t94  t95  t97  t96 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
## t100  t99  t98  t91  t93  t92  t48  t49  t54  t53  t50  t52  t51  t59  t58  t57 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
##  t55  t56  t60  t61  t62  t64  t63  t69  t70  t72  t71  t73  t65  t66  t67  t68 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
## Levels: - 0 1

ASR

First, we need to detect taxa with missing states. The following solution works to distinguish numerical coding from any other type (such as “-” or “?”):

numeric_coding <- as.numeric(as.character(char42[match(tree1$tip.label,names(char42))]))

## Warning: NAs introduced by coercion

names(numeric_coding) <- names(char42[match(tree1$tip.label,names(char42))])

Basically anything not transformable to numeric will become NA. This solution may not work if you have more that 10 states (A, B, C etc.), then a different solution is needed, perhaps states with “-” or “?” be replaced by NA. Now, if there are any NA, we work with the following code:

# prune the taxa with NA coding from the tree
pruned_tree <- drop.tip(tree1,tree1$tip.label[is.na(numeric_coding)])
# prune the taxa with NA from the vector with codings
pruned_coding <- numeric_coding[!is.na(numeric_coding)]
# perform your fav ASR method, here I do ace() for simplicity. Outputs of other ASR functions may need to be adjusted before working with this code
asr <- ace(pruned_coding,pruned_tree,"discrete",marginal=F)
# now we create a matrix of ASR proportional likelihoods for the original tree.
# The matrix is filled by NA (hence rep(NA) there in the code), and we will then populate good nodes with non NA values.
props_original <- matrix(rep(NA,tree1$Nnode*length(levels(as.factor(numeric_coding)))),ncol = length(levels(as.factor(numeric_coding))))
# now in this loop we go over the ace() output (asr$lik.anc) and put it in corresponding rows of the props_original.
# Node correspondence between the original and the prunned tree is checked by getMRCA function.
for (n in 1:length(asr$lik.anc[,1])){
  props_original[getMRCA(tree1, tips(pruned_tree, n+length(pruned_tree$tip.label))) - length(tree1$tip.label),] <- asr$lik.anc[n,]
}

From now on instead of asr$lik.anc we need to use props_original for our pie data. That’s it! Here are the vizualization examples as in the tutorial 2. Few changes because of incomplete coding are highlighted in the comments.

Visualizing the ASR using pie charts

First I will demonstrate how to plot this data using regular APE toolkit. I plot a phylogeny itself, then set the colors of tip labels and annotate tip labels to show tip character states. Finally the pie charts are plotted, using the same colors.

par(mar=c(0,0,0,0)) #reduce white space around the plot
plot(tree1, cex=0.5) #draw the tree, with tiny tip labels
#set up colors for tip taxa
colors1 <- rep("gray", length(tree1$tip.label)) #IMPORTANT: gray here is a base color and will be used for non-coded taxa. Coded will be replaced with a different color in the next couple lines
colors1[char42=="0"] <- "red"
colors1[char42=="1"] <- "blue"
#plot tip circles
tiplabels(rep("",144), #plot circles for tip states
          match(names(char42), tree1$tip.label),
          bg=colors1,
          frame = "circle",
          cex=0.2)
par(lwd=0.5) #change global line width settings to narrow
#plot pie charts #IMPORTANT: remember to use props_original instead of asr$lik.anc here
nodelabels(pie = props_original, cex=0.3, piecol = c("red","blue"))

We can generate a similar figure using ggtree. However, data input needs to be formatted a bit different. So first I reformat data, then I create a ggplot object with the tree, data frame attached to it, tip labels (those are plotted separately in ggtree), points annotating tip character states, and color set up. Finally, pies are created using ggtree nodepie function, and those are then inserted onto the previously created ggplot object. (Disclaimer: package ggtree and functions for insets have been changing much lately, so the code is not guaranteed to work, check with the current package manual)

library(ggplot2)
library(ggtree)

## Registered S3 method overwritten by 'treeio':
##   method     from
##   root.phylo ape

## ggtree v2.0.1  For help: https://yulab-smu.github.io/treedata-book/
## 
## If you use ggtree in published research, please cite the most appropriate paper(s):
## 
## [36m-[39m Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods for mapping and visualizing associated data on phylogeny using ggtree. Molecular Biology and Evolution 2018, 35(12):3041-3043. doi: 10.1093/molbev/msy194
## [36m-[39m Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36, doi:10.1111/2041-210X.12628

## 
## Attaching package: 'ggtree'

## The following object is masked from 'package:ape':
## 
##     rotate

#reformat data for tip character state annotation  #IMPORTANT: because of the dashes in the coding slightly different transformation of data_df$V42 is needed as shown
df1 <- as.data.frame(cbind(taxa=rownames(data_df), state=as.numeric(as.character(data_df$V42))), stringsAsFactors = F)

## Warning in cbind(taxa = rownames(data_df), state =
## as.numeric(as.character(data_df$V42))): NAs introduced by coercion

#create plot with tree, tip labels, and tip states
#IMPORTANT: since the codings in the tip dataframe are NA, scale_fill_manual is parametrized with the color for NA. This reflects it in the legend automatically
p <- ggtree(tree1, ladderize=F) %<+%  df1 + geom_tiplab(size=2) + geom_tippoint(aes(fill=state),shape=21) +
  scale_fill_manual(values = c("red","blue"), na.value="gray")
#reformat data for pie charts #IMPORTANT: remember to use props_original instead of asr$lik.anc here
pie_data <- as.data.frame(cbind(props_original, node=144+(1:143)))
#create pie charts
pies <- nodepie(pie_data, 1:2, c("red","blue"))
#plot pie charts on the main plot
inset(p, pies,width = 0.08, height = 0.08)

## Loading required package: ggimage

## Warning: Removed 2 rows containing missing values (position_stack).

## Warning: Removed 2 rows containing missing values (position_stack).

## Warning: Removed 2 rows containing missing values (position_stack).

## Warning: Removed 2 rows containing missing values (position_stack).

## Warning: Removed 2 rows containing missing values (position_stack).

## Warning: Removed 2 rows containing missing values (position_stack).

## Warning: Removed 2 rows containing missing values (position_stack).

## Warning: Removed 2 rows containing missing values (position_stack).

## Warning: Removed 2 rows containing missing values (position_stack).