Tutorial on State Change Location Reconstruction Based on ML ASR with Missing Data in R

R tutorial 5: State Change Location Reconstruction Based on ML ASR with Missing Data

This tutorial builds up on the tutorial 4, so if you have not seen that, please follow this link. There are several solutions to obtain the proportional probabilities of each state at each node. However, suppose, we are more interested to note the likely state changes, and want to reconstruct on which branches the state change took place, using a simplistic rule - if the most probable state of the ancestor is different from that of the descendant, then this is the branch where the state change occured. The focus of this tutorial is on making such reconstruction high throughput, processing multiple characters together, and ggtree package will be used.

Legend

Code block

Output block

Console command block

Tools needed

We would need the following packages:

package APE
package geiger
package phangorn
package ggplot2
package ggtree

And the following files:

phylogenetic tree in newick format
discrete data in nexus format

Read in the phylogeny and the data matrix

As in the previous tutorial, the phylogeny in this example is a bipartitions file output from RAxML. The data matrix is a nexus file, exported from Mesquite in simplified format. We can load the files like that if they are in the working directory:

library(ape)
library(geiger)

## Registered S3 method overwritten by 'geiger':
##   method            from
##   unique.multiPhylo ape

library(phangorn)

# load tree and data
tree1 <- ladderize(read.tree("treefile.tre"))
data_nex <- read.nexus.data("matrix.nex")

As in the previous tutorial, the matrix is reformatted, however we do not extract a particular character to work with, but rather keep the whole data frame together.

data_df <- as.data.frame(do.call(rbind, data_nex)) #reformat to data frame

data_df #check out the codings

##      V1 V2 V3
## t144  0  0  2
## t143  1  0  1
## t141  1  0  0
## t140  1  0  0
## t142  1  0  1
## t138  1  0  2
## t139  1  0  2
## t134  1  0  2
## t135  1  0  2
## t137  1  0  2
## t136  1  0  2
## t133  1  ?  0
## t132  1  0  0
## t131  1  0  0
## t22   1  0  0
## t23   1  1  0
## t24   1  1  0
## t27   1  0  0
## t26   1  0  0
## t25   1  0  0
## t6    1  0  0
## t5    1  0  0
## t4    1  0  0
## t3    1  1  0
## t2    1  0  0
## t1    1  1  0
## t8    1  0  0
## t21   1  0  0
## t17   1  0  0
## t18   1  0  0
## t20   1  0  0
## t19   1  0  0
## t14   1  0  0
## t15   1  0  0
## t16   1  0  0
## t10   1  0  0
## t9    1  0  0
## t11   1  0  0
## t12   1  0  0
## t13   1  0  0
## t7    1  0  0
## t130  0  0  1
## t28   1  1  0
## t30   1  1  0
## t29   1  0  0
## t31   1  0  0
## t33   1  0  0
## t32   1  1  0
## t43   1  ?  1
## t45   1  0  0
## t44   1  1  0
## t46   1  0  0
## t47   1  0  0
## t34   1  1  1
## t36   1  1  0
## t35   1  0  0
## t37   1  1  0
## t38   1  0  0
## t42   1  1  0
## t41   1  0  0
## t40   1  1  0
## t39   1  1  0
## t129  1  ?  1
## t127  1  ?  0
## t128  1  ?  1
## t111  1  ?  1
## t117  1  0  1
## t116  1  0  1
## t115  1  0  1
## t114  1  0  1
## t113  1  0  1
## t112  1  1  1
## t118  1  0  0
## t119  1  0  1
## t109  1  0  1
## t110  1  0  1
## t126  1  1  0
## t125  1  1  0
## t124  1  1  0
## t121  1  1  0
## t123  1  1  0
## t120  1  1  0
## t122  1  1  0
## t105  1  0  1
## t106  1  2  1
## t104  1  0  1
## t107  1  1  0
## t108  1  0  0
## t75   1  0  0
## t74   1  0  0
## t76   1  0  1
## t77   1  0  0
## t78   1  1  0
## t103  1  0  1
## t102  1  0  1
## t82   1  0  1
## t83   1  0  1
## t81   1  1  1
## t80   1  1  1
## t79   1  1  1
## t90   1  0  0
## t87   1  0  0
## t88   1  0  0
## t89   1  0  0
## t84   1  0  0
## t85   1  0  0
## t86   1  0  0
## t101  1  1  0
## t94   1  0  0
## t95   1  0  0
## t97   1  0  0
## t96   1  0  0
## t100  1  0  0
## t99   1  0  0
## t98   1  0  0
## t91   1  0  0
## t93   1  1  0
## t92   1  1  0
## t48   1  0  0
## t49   1  1  0
## t54   1  0  0
## t53   1  0  0
## t50   1  0  0
## t52   1  0  0
## t51   1  0  0
## t59   1  0  0
## t58   1  1  0
## t57   1  0  0
## t55   1  1  0
## t56   1  1  0
## t60   1  0  0
## t61   1  1  0
## t62   1  0  0
## t64   1  0  0
## t63   1  0  0
## t69   1  0  0
## t70   1  0  0
## t72   1  0  0
## t71   1  0  0
## t73   1  0  0
## t65   1  0  0
## t66   1  0  0
## t67   1  0  0
## t68   1  0  0

As you can see, character 1 has all taxa coded for states 0 and 1, character 2 has states 0, 1, and 2, plus some taxa have missing data in a form of question marks, and character 3 has three states without missing data.

ASR

First, I will convert the missing data states to NA as in the previous tutorial, note that this is an implicit conversion:

numeric_coding_mx <- apply(data_df, c(1,2), as.numeric)

## Warning in apply(data_df, c(1, 2), as.numeric): NAs introduced by coercion

## Warning in apply(data_df, c(1, 2), as.numeric): NAs introduced by coercion

## Warning in apply(data_df, c(1, 2), as.numeric): NAs introduced by coercion

## Warning in apply(data_df, c(1, 2), as.numeric): NAs introduced by coercion

## Warning in apply(data_df, c(1, 2), as.numeric): NAs introduced by coercion

## Warning in apply(data_df, c(1, 2), as.numeric): NAs introduced by coercion

Basically anything not transformable to numeric will become NA. This solution may not work if you have more that 10 states (A, B, C etc.), then a different solution is needed, perhaps states with “-” or “?” be replaced by NA explicitly. Now, we process the matrix in the loop, character by character, first doing the ASR assuming there are some missing data in all characters, and then converting this data to locations of steps. See the comments in the code for more details

#create a list to store proportional probabilities data
ancestral_states <- list()
#create a list to store character steps
character_state_steps <- list()
#run loop by the number of characters in the matrix
for (char in 1:length(numeric_coding_mx[1,]) ){
  #ASR part
  #extract a particular character from the matrix AND resort the rows to match the tree
  #NOTE: arranging taxa in matrix to be in the same order as in the tree is important!
  #failure to do so make cause lots of confusion later on
  numeric_coding <- numeric_coding_mx[,char][match(tree1$tip.label,rownames(numeric_coding_mx))]
  # prune the taxa with NA coding from the tree
  pruned_tree <- drop.tip(tree1,tree1$tip.label[is.na(numeric_coding)])
  # prune the taxa with NA from the vector with codings
  pruned_coding <- numeric_coding[!is.na(numeric_coding)]
  # perform your fav ASR method, here I do ace() for simplicity
  # Outputs of other ASR functions may need to be adjusted before working with this code
  asr <- ace(pruned_coding,pruned_tree,"discrete",marginal=F)
  # now we create a matrix of ASR data for the original tree
  # The matrix is filled by NA (hence rep(NA) there in the code), and we will then populate good nodes with non NA values.
  props_original <- matrix(rep(NA,tree1$Nnode*length(levels(as.factor(numeric_coding)))),ncol = length(levels(as.factor(numeric_coding))))
  # now in this loop we go over the ace() output (asr$lik.anc) and put it in corresponding rows of the props_original.
  # Node correspondence between the original and the prunned tree is checked by getMRCA function.
  for (n in 1:length(asr$lik.anc[,1])){
    props_original[getMRCA(tree1, tips(pruned_tree, n+length(pruned_tree$tip.label))) - length(tree1$tip.label),] <- asr$lik.anc[n,]
  }
  #finally we add this data as an item to our ancestral states list
  ancestral_states[[char]] <- props_original
  
  #character steps reconstruction part
  #create an empty vector to store steps in the pruned tree
  steps_pre <- rep(NA,pruned_tree$Nnode+length(pruned_tree$tip.label))
  #this is the pruned tree root - first node after the last tip
  temproot <- length(pruned_tree$tip.label)+1
  #get most likely state per node - this is the important part
  #if you want a different step threshold, this part would need to be adjusted
  best_node_states <- apply(asr$lik.anc,1,which.max)-1
  #combine the tip data with the most likely node data to get a state for each node (incl tips)
  best_total_states <- c(pruned_coding, best_node_states) 
  #now go through the nodes to detect steps
  for (x in 1:length(best_total_states)){
    if(x == temproot){
      #this is the root node - normally no steps occur here, so this could be replaced with
      #next()
      #however, I want to vizualize states at root too, so here I will use most likely states at root
      #as reconstructed by ace()
      steps_pre[x] <- best_total_states[x]
      #it's also possible that both kids of the root will have different most likely state than the root
      #in that case you can imagine both branches from the root having some steps,
      #even though it's technically a single branch
      #for this problem, one would need to write additional piece of code to correct this
      #by supplying the outgroup clade index and moving states steps around
    } else{
      #get the parent state using phangorn Ancestors function
      elem0 <- best_total_states[Ancestors(pruned_tree, x, type = "parent")]
      #this is the kid state
      elem1 <- best_total_states[x]
      #if the kid is different from the parent we record the kid state as the step state
      if (elem1 != elem0){
        steps_pre[x] <- elem1
      }
    }
  }
  #as this is a generic solution and we worked with a pruned tree
  #we now need to copy the state steps onto appropriate branches of the original tree
  #this will hold the original tree char steps data
  step_braches <- rep(NA,length(tree1$tip.label)+tree1$Nnode)
  #do a loop over pruned tree steps
  for (b1 in 1:length(best_total_states)){
    #this is an internal node - use getMRCA to obtain the correspondence
    if (b1>length(pruned_tree$tip.label)){
      step_braches[getMRCA(tree1, tips(pruned_tree, b1))] <- steps_pre[b1]
    } else { #this is a tip - use grep to obtain the correspondence
      step_braches[grep(tips(pruned_tree, b1),tree1$tip.label)] <- steps_pre[b1]
    }
  }
  #finally we add this data as an item to our character steps list
  character_state_steps[[char]] <- step_braches
}

Visualizing the steps using ggtree

Below is the code for the character steps vizualization. See the comments in the code for details. (Disclaimer: package ggtree and functions for insets have been changing much lately, so the code is not guaranteed to work, check with the current package manual)

library(ggplot2)
library(ggtree)

## Registered S3 method overwritten by 'treeio':
##   method     from
##   root.phylo ape

## ggtree v2.0.1  For help: https://yulab-smu.github.io/treedata-book/
## 
## If you use ggtree in published research, please cite the most appropriate paper(s):
## 
## [36m-[39m Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods for mapping and visualizing associated data on phylogeny using ggtree. Molecular Biology and Evolution 2018, 35(12):3041-3043. doi: 10.1093/molbev/msy194
## [36m-[39m Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36, doi:10.1111/2041-210X.12628

## 
## Attaching package: 'ggtree'

## The following object is masked from 'package:ape':
## 
##     rotate

#in addition to steps we will need to obtain the number of steps per branch,
#so for branches with multiple steps vizualized we can space them nicely
character_state_steps_counts <- rapply(character_state_steps, function(x) ifelse(!(is.na(x)), 1 ,0), how = "replace")
state_change_per_branch <- Reduce(`+`,character_state_steps_counts)
#this is the state spacing horizontally when multiple on same branch
statespacing <- 0.0075
#this is bar thickness
#we will be using geom_tile for this although many other options exist
thickness <- 0.005

#plot the tree and the tips
p <- ggtree(tree1, ladderize=F) + geom_tiplab(size=2)

#now iterate over characters to add steps on the figure
for (char in 1:3){
  #since the way I'm storing the steps data is a bit awkward and not something ggplot would want
  #i create a data frame for each character and then use it to plot the bars
  #otherwise there are issues with iteration in the for loop...
  chardata <- data.frame(node = 1:287, branch = p$data$branch, charnum = rep(char, 287), y = p$data$y, charOffset = Reduce(`+`,character_state_steps_counts[1:char])*statespacing-(state_change_per_branch/2)*statespacing-(statespacing/2), charChange = character_state_steps[[char]])
  #if we plot chardata as is we will get all NA states to show up as well
  #instead we remove NA
  chardata <- chardata[!is.na(chardata$charChange),]
  #now actual plotting
  #we add three layes to plot at each iteration
  #tile plot with states
  #character label on top
  #state label on the bottom
  p <- p + geom_tile(data = chardata, aes(x=branch+charOffset, fill=as.character(charChange)),width = thickness, height=2, size=3) + geom_label(data = chardata, aes( x=branch+charOffset, y=y+2, label=charnum), size=2, label.padding = unit(0.1, "lines")) + geom_label(data = chardata, aes( x=branch+charOffset, y=y-2, label=charChange), size=2,label.padding = unit(0.1, "lines")) + theme(legend.position = "none")
}
p

State changes look a bit unreadable on some of the shorter branches. This can be made a bit better by sacrificing the branch length data and showing a cladogram instead. It entails changing branch.length parameter of ggtree, as well as adjusting width of bars and spacing - tree length istead of being ~1 would proportional to the number of splits.

Now let’s check that this step reconstruction makes sence for one of the characters (character 3) by plotting pie charts and tip data along

#we are checking character 3
chartocheck <- 3
#add the tip states as points to the previous plot
p <- p + geom_tippoint(aes(fill=c(as.character(numeric_coding_mx[,chartocheck][match(tree1$tip.label,rownames(numeric_coding_mx))]), rep(NA,tree1$Nnode))),shape=21)
#reformat data for pie charts #IMPORTANT: remember to use props_original instead of asr$lik.anc here
pie_data <- as.data.frame(cbind(ancestral_states[[chartocheck]], node=length(tree1$tip.label)+(1:tree1$Nnode)))
#create pie charts
pies <- nodepie(pie_data, 1:length(ancestral_states[[chartocheck]][1,]), c("red","green","blue"), alpha=0.25)
#add pie charts on the main plot
inset(p, pies,width = 0.08, height = 0.08)

## Loading required package: ggimage

Looking at transitions of character 3, they seem to make sense given the pie charts and the tip color points. It would be nice to be able to vizualize proportional probabilities of states and tips states of multiple character on the same figure. This will be the focus of the next tutorial