my GitHub Web Site
Despite using matrices with incomplete codings is common in phylogenetics, only Mesquite natively allows to perform and display ancestral state reconstruction for incompletely coded discrete characters. Sadly, I yet have to find an R package that supports such feature (check out Tutorial 2 for a basic ASR in R). In the mean time, here is how you can quickly work around this problem.
|
|
|
We would need the following packages:
And the following files:
Phylogeny in this example is a bipartitions file output from RAxML. The data matrix is a nexus file, exported from Mesquite in simplified format. We can load the files like that if they are in the working directory:
library(ape)
library(geiger)
## Registered S3 method overwritten by 'geiger':
## method from
## unique.multiPhylo ape
# load tree and data
tree1 <- ladderize(read.tree("treefile.tre"))
data_nex <- read.nexus.data("matrix.nex")
Matrix was read into a list of vectors each corresponding to row of the original matrix (taxon). To be able to easily access a column of the matrix, I reformat it to data frame. The matrix contains multiple characters. Typically, a single character is reconstructed and vizualized at a time. As in the tutorial 2, I will extract a single character, with incomplete coding (number 42), from my matrix.
data_df <- as.data.frame(do.call(rbind, data_nex)) #reformat to data frame
char42 <- data_df[ ,42] #subdivide single character (42nd)
char42 #check out the codings
## t144 t143 t141 t140 t142 t138 t139 t134 t135 t137 t136 t133 t132 t131 t22 t23
## - - 0 0 0 0 0 0 0 0 0 - 0 0 1 1
## t24 t27 t26 t25 t6 t5 t4 t3 t2 t1 t8 t21 t17 t18 t20 t19
## 1 1 1 1 - - - - - - 0 0 0 0 0 0
## t14 t15 t16 t10 t9 t11 t12 t13 t7 t130 t28 t30 t29 t31 t33 t32
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## t43 t45 t44 t46 t47 t34 t36 t35 t37 t38 t42 t41 t40 t39 t129 t127
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## t128 t111 t117 t116 t115 t114 t113 t112 t118 t119 t109 t110 t126 t125 t124 t121
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## t123 t120 t122 t105 t106 t104 t107 t108 t75 t74 t76 t77 t78 t103 t102 t82
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## t83 t81 t80 t79 t90 t87 t88 t89 t84 t85 t86 t101 t94 t95 t97 t96
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## t100 t99 t98 t91 t93 t92 t48 t49 t54 t53 t50 t52 t51 t59 t58 t57
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## t55 t56 t60 t61 t62 t64 t63 t69 t70 t72 t71 t73 t65 t66 t67 t68
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: - 0 1
First, we need to detect taxa with missing states. The following solution works to distinguish numerical coding from any other type (such as “-” or “?”):
numeric_coding <- as.numeric(as.character(char42[match(tree1$tip.label,names(char42))]))
## Warning: NAs introduced by coercion
names(numeric_coding) <- names(char42[match(tree1$tip.label,names(char42))])
Basically anything not transformable to numeric will become NA
. This solution may not work if you have more that 10 states (A, B, C etc.), then a different solution is needed, perhaps states with “-” or “?” be replaced by NA
. Now, if there are any NA
, we work with the following code:
# prune the taxa with NA coding from the tree
pruned_tree <- drop.tip(tree1,tree1$tip.label[is.na(numeric_coding)])
# prune the taxa with NA from the vector with codings
pruned_coding <- numeric_coding[!is.na(numeric_coding)]
# perform your fav ASR method, here I do ace() for simplicity. Outputs of other ASR functions may need to be adjusted before working with this code
asr <- ace(pruned_coding,pruned_tree,"discrete",marginal=F)
# now we create a matrix of ASR proportional likelihoods for the original tree.
# The matrix is filled by NA (hence rep(NA) there in the code), and we will then populate good nodes with non NA values.
props_original <- matrix(rep(NA,tree1$Nnode*length(levels(as.factor(numeric_coding)))),ncol = length(levels(as.factor(numeric_coding))))
# now in this loop we go over the ace() output (asr$lik.anc) and put it in corresponding rows of the props_original.
# Node correspondence between the original and the prunned tree is checked by getMRCA function.
for (n in 1:length(asr$lik.anc[,1])){
props_original[getMRCA(tree1, tips(pruned_tree, n+length(pruned_tree$tip.label))) - length(tree1$tip.label),] <- asr$lik.anc[n,]
}
From now on instead of asr$lik.anc
we need to use props_original
for our pie data. That’s it! Here are the vizualization examples as in the tutorial 2. Few changes because of incomplete coding are highlighted in the comments.
First I will demonstrate how to plot this data using regular APE toolkit. I plot a phylogeny itself, then set the colors of tip labels and annotate tip labels to show tip character states. Finally the pie charts are plotted, using the same colors.
par(mar=c(0,0,0,0)) #reduce white space around the plot
plot(tree1, cex=0.5) #draw the tree, with tiny tip labels
#set up colors for tip taxa
colors1 <- rep("gray", length(tree1$tip.label)) #IMPORTANT: gray here is a base color and will be used for non-coded taxa. Coded will be replaced with a different color in the next couple lines
colors1[char42=="0"] <- "red"
colors1[char42=="1"] <- "blue"
#plot tip circles
tiplabels(rep("",144), #plot circles for tip states
match(names(char42), tree1$tip.label),
bg=colors1,
frame = "circle",
cex=0.2)
par(lwd=0.5) #change global line width settings to narrow
#plot pie charts #IMPORTANT: remember to use props_original instead of asr$lik.anc here
nodelabels(pie = props_original, cex=0.3, piecol = c("red","blue"))
We can generate a similar figure using ggtree. However, data input needs to be formatted a bit different. So first I reformat data, then I create a ggplot object with the tree, data frame attached to it, tip labels (those are plotted separately in ggtree), points annotating tip character states, and color set up. Finally, pies are created using ggtree nodepie
function, and those are then inserted onto the previously created ggplot object. (Disclaimer: package ggtree
and functions for insets have been changing much lately, so the code is not guaranteed to work, check with the current package manual)
library(ggplot2)
library(ggtree)
## Registered S3 method overwritten by 'treeio':
## method from
## root.phylo ape
## ggtree v2.0.1 For help: https://yulab-smu.github.io/treedata-book/
##
## If you use ggtree in published research, please cite the most appropriate paper(s):
##
## [36m-[39m Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods for mapping and visualizing associated data on phylogeny using ggtree. Molecular Biology and Evolution 2018, 35(12):3041-3043. doi: 10.1093/molbev/msy194
## [36m-[39m Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36, doi:10.1111/2041-210X.12628
##
## Attaching package: 'ggtree'
## The following object is masked from 'package:ape':
##
## rotate
#reformat data for tip character state annotation #IMPORTANT: because of the dashes in the coding slightly different transformation of data_df$V42 is needed as shown
df1 <- as.data.frame(cbind(taxa=rownames(data_df), state=as.numeric(as.character(data_df$V42))), stringsAsFactors = F)
## Warning in cbind(taxa = rownames(data_df), state =
## as.numeric(as.character(data_df$V42))): NAs introduced by coercion
#create plot with tree, tip labels, and tip states
#IMPORTANT: since the codings in the tip dataframe are NA, scale_fill_manual is parametrized with the color for NA. This reflects it in the legend automatically
p <- ggtree(tree1, ladderize=F) %<+% df1 + geom_tiplab(size=2) + geom_tippoint(aes(fill=state),shape=21) +
scale_fill_manual(values = c("red","blue"), na.value="gray")
#reformat data for pie charts #IMPORTANT: remember to use props_original instead of asr$lik.anc here
pie_data <- as.data.frame(cbind(props_original, node=144+(1:143)))
#create pie charts
pies <- nodepie(pie_data, 1:2, c("red","blue"))
#plot pie charts on the main plot
inset(p, pies,width = 0.08, height = 0.08)
## Loading required package: ggimage
## Warning: Removed 2 rows containing missing values (position_stack).
## Warning: Removed 2 rows containing missing values (position_stack).
## Warning: Removed 2 rows containing missing values (position_stack).
## Warning: Removed 2 rows containing missing values (position_stack).
## Warning: Removed 2 rows containing missing values (position_stack).
## Warning: Removed 2 rows containing missing values (position_stack).
## Warning: Removed 2 rows containing missing values (position_stack).
## Warning: Removed 2 rows containing missing values (position_stack).
## Warning: Removed 2 rows containing missing values (position_stack).