Wednesday, April 24, 2024

Pruning a mega-phylogeny to include a list of taxa in which the names don't match exactly

The other day a colleague asked me for advice on getting a reasonable time-calibrated phylogeny for a set of taxa to use for a comparative physiology exercise he was developing for a class. I suggested pulling a tree from the online resource VertLife.org and then subsampling it to include only the taxa for which he had data. If he would supply me with a list of the taxon names for his species, I told him, I’d be happy to help.

This is what I did.

## load packages
library(phytools)

We can start by reading the full DNA maximum clade credibility tree directly from the VertLife website as follows.

download.file(
  url="https://raw.githubusercontent.com/n8upham/MamPhy_v1/master/_DATA/MamPhy_fullPosterior_BDvr_DNAonly_4098sp_topoFree_NDexp_MCC_v2_target.tre",destfile="full_mammal_tree.nex")
mammal_tree<-read.nexus(file="full_mammal_tree.nex")
mammal_tree
## 
## Phylogenetic tree with 4100 tips and 4099 internal nodes.
## 
## Tip labels:
##   _Anolis_carolinensis, Ornithorhynchus_anatinus_ORNITHORHYNCHIDAE_MONOTREMATA, Zaglossus_bruijnii_TACHYGLOSSIDAE_MONOTREMATA, Tachyglossus_aculeatus_TACHYGLOSSIDAE_MONOTREMATA, Rhynchocyon_petersi_MACROSCELIDIDAE_MACROSCELIDEA, Rhynchocyon_cirnei_MACROSCELIDIDAE_MACROSCELIDEA, ...
## 
## Rooted; includes branch lengths.

Now, if we graph this tree, we’ll see that it’s ultrametric. It should be as it contains only extant species & is time-calibrated! Let’s drop the one non-mammal outgroup (Anolis carolinensis) and plot our tree to see this.

mammal_tree<-drop.tip(mammal_tree,"_Anolis_carolinensis")
plotTree(mammal_tree,ftype="off",lwd=1,type="arc",
  arc_height=0.05,part=0.995)

plot of chunk unnamed-chunk-5

Nonetheless, as is common in R for trees with this many taxa, our "phylo" object will not pass is.ultrametric. This is entirely a rounding artifact resulting from having written the tree to a text file! (But is destined to cause us problems later, I promise.)

is.ultrametric(mammal_tree)
## [1] FALSE

This is the one & only circumstance in which phytools::force.ultrametric should be used. force.ultrametric is just going to coerce our very-nearly-but-not-precisely ultrametric tree to be exactly ultrametric to the numerical precision of our machine. Let’s do it.

mammal_tree<-force.ultrametric(mammal_tree)
## ***************************************************************
## *                          Note:                              *
## *    force.ultrametric does not include a formal method to    *
## *    ultrametricize a tree & should only be used to coerce    *
## *   a phylogeny that fails is.ultrametric due to rounding --  *
## *    not as a substitute for formal rate-smoothing methods.   *
## ***************************************************************
is.ultrametric(mammal_tree)
## [1] TRUE

Excellent.

Next, let’s get our species names from file as follows.

spp_names<-read.csv(file="spp_names.csv",header=FALSE)
head(spp_names)

Let’s compare these names to a sample of names from our mammal tree.

head(mammal_tree$tip.label)
## [1] "Ornithorhynchus_anatinus_ORNITHORHYNCHIDAE_MONOTREMATA"
## [2] "Zaglossus_bruijnii_TACHYGLOSSIDAE_MONOTREMATA"         
## [3] "Tachyglossus_aculeatus_TACHYGLOSSIDAE_MONOTREMATA"     
## [4] "Rhynchocyon_petersi_MACROSCELIDIDAE_MACROSCELIDEA"     
## [5] "Rhynchocyon_cirnei_MACROSCELIDIDAE_MACROSCELIDEA"      
## [6] "Rhynchocyon_udzungwensis_MACROSCELIDIDAE_MACROSCELIDEA"

We can see that our species list names are encoded in “genus-space-specific epithet” format while the taxon labels in our tree are “genus-underscore-specific epithet-underscore-family-underscore-order”.

First, let’s pull out our species list names and replace the spaces with underscores as follows.

spp<-gsub(" ","_",spp_names$V1)
head(spp,20)
##  [1] "Acomys_kempi"              "Acomys_subspinosus"        "Aconaemys_fuscus"         
##  [4] "Alticola_albicaudus"       "Ammospermophilus_leucurus" "Anomalurus_beecrofti"     
##  [7] "Anomalurus_derbianus"      "Anomalurus_pelii"          "Anomalurus_pusillus"      
## [10] "Callosciurus_prevostii"    "Cavia_porcellus"           "Chinchilla_chinchilla"    
## [13] "Chinchilla_lanigera"       "Chionomys_nivalis"         "Coendou_mexicanus"        
## [16] "Coendou_pruinosus"         "Coendou_vestitus"          "Cricetus_cricetus"        
## [19] "Ctenomys_australis"        "Ctenomys_emilianus"

Now, let’s identify the tip numbers of each tip with an exactly matching species designation. This will be a majority, but not all, of the species in our list.

tip<-sapply(spp,grep,x=mammal_tree$tip.label)
head(tip)
## $Acomys_kempi
## integer(0)
## 
## $Acomys_subspinosus
## integer(0)
## 
## $Aconaemys_fuscus
## [1] 2681
## 
## $Alticola_albicaudus
## [1] 2799
## 
## $Ammospermophilus_leucurus
## [1] 2364
## 
## $Anomalurus_beecrofti
## [1] 3818

We can see, for example, that the species Alticola albicaudus is tip number 2,681 in our tree, but Acomys kempi and A. subspinosus are missing.

mm<-names(which(sapply(tip,length)==0))
mm
## [1] "Acomys_kempi"              "Acomys_subspinosus"        "Anomalurus_derbianus"     
## [4] "Anomalurus_pelii"          "Anomalurus_pusillus"       "Ctenomys_emilianus"       
## [7] "Heliosciurus_rufobrachium" "Steatomys_pratensis"       "Zenkerella_insignis"

This is a list of all the species in our list that don’t have an exact match in the full tree.

What to do next is a bit arbitrary – but what I decided to do is first try to identify all the labels in our tree belonging to a matching genus or our missing taxa. Then, if the genus contained more than one taxon, I attached our missing label to the common ancestor of each genus. (If the genus only had one representative in the full tree I attached our missing tip to the midway point of the terminal edge leading to that single species.) Lastly, and this only applied to one case, if a species belonged to a genus that was totally absent from the tree entirely, I simply dropped it. (OK – I could’ve figured out which family it belonged to and attached the species there instead, but my colleague didn’t supply that info so I’m going to be lazy & just leave it out.)

missing_genera<-sapply(strsplit(mm,"_"),function(x) x[1])
missing_genera
## [1] "Acomys"       "Acomys"       "Anomalurus"   "Anomalurus"   "Anomalurus"  
## [6] "Ctenomys"     "Heliosciurus" "Steatomys"    "Zenkerella"
missing<-vector()
for(i in 1:length(missing_genera)){
  tips<-grep(missing_genera[i],mammal_tree$tip.label)
  if(length(tips)>1){
    ca<-findMRCA(mammal_tree,
      tips=mammal_tree$tip.label[tips])
    subtree<-extract.clade(mammal_tree,ca)
    mammal_tree<-bind.tip(mammal_tree,tip.label=mm[i],
      where=ca)
  } else if(length(tips)==1){
    mammal_tree<-bind.tip(mammal_tree,tip.label=mm[i],
      where=tips,position=0.5*mammal_tree$edge.length[
        which(mammal_tree$edge[,2]==tips)])
    mammal_tree$tip.label[tips]
  } else {
    cat(paste("Sorry. No members of",missing_genera[i],
      "found in the tree.\n\n"))
    missing<-c(missing,mm[i])
  }
}
## Sorry. No members of Zenkerella found in the tree.
spp<-if(length(missing)>0) spp[-match(missing,spp)] else 
  spp

Here is our “enriched” tree.

mammal_tree
## 
## Phylogenetic tree with 4107 tips and 4099 internal nodes.
## 
## Tip labels:
##   Ornithorhynchus_anatinus_ORNITHORHYNCHIDAE_MONOTREMATA, Zaglossus_bruijnii_TACHYGLOSSIDAE_MONOTREMATA, Tachyglossus_aculeatus_TACHYGLOSSIDAE_MONOTREMATA, Solenodon_paradoxus_SOLENODONTIDAE_EULIPOTYPHLA, Uropsilus_investigator_TALPIDAE_EULIPOTYPHLA, Uropsilus_soricipes_TALPIDAE_EULIPOTYPHLA, ...
## 
## Rooted; includes branch lengths.

OK. Now let’s prune our full tree to include only the species in our list. To do that we also need to remember that our list and original tree file used different label conventions (as previously described), so we can use grep to permit partial matching.

pruned.mammal_tree<-keep.tip(mammal_tree,
  mammal_tree$tip.label[sapply(spp,grep,
  x=mammal_tree$tip.label)])
pruned.mammal_tree
## 
## Phylogenetic tree with 62 tips and 59 internal nodes.
## 
## Tip labels:
##   Heliosciurus_rufobrachium, Tamias_striatus_SCIURIDAE_RODENTIA, Marmota_monax_SCIURIDAE_RODENTIA, Urocitellus_richardsonii_SCIURIDAE_RODENTIA, Urocitellus_parryii_SCIURIDAE_RODENTIA, Ammospermophilus_leucurus_SCIURIDAE_RODENTIA, ...
## 
## Rooted; includes branch lengths.

Finally, let’s update our tip labels as follows.

tip_labels<-sapply(strsplit(pruned.mammal_tree$tip.label,"_"),
  function(x) paste(x[1:2],collapse="_"))
pruned.mammal_tree$tip.label<-tip_labels

How did we do?

plotTree(pruned.mammal_tree,ftype="i",fsize=0.7,
  direction="upwards",offset=1)

plot of chunk unnamed-chunk-18

Not bad.

Finally, here’s something for Twitter: a plot showing our “extracted” tree on top of the original phylogeny (but reduced to include only the descendants of the MRCA of the taxa in our list).

sub.mammal_tree<-extract.clade(mammal_tree,
  findMRCA(mammal_tree,
    tips=mammal_tree$tip.label[sapply(spp,grep,
      x=mammal_tree$tip.label)]))
sub.mammal_tree<-multi2di(sub.mammal_tree)
tips<-sapply(spp,grep,x=sub.mammal_tree$tip.label)
parents<-lapply(tips,phangorn::Ancestors,
  x=sub.mammal_tree,type="all")
nodes<-setdiff(c(tips,unique(unlist(parents))),
  Ntip(sub.mammal_tree)+1)
painted.mammal_tree<-paintBranches(sub.mammal_tree,
  nodes,state="1",anc.state="0")
cols<-setNames(c(make.transparent("grey",0.6),
  "transparent"),0:1)
plot(painted.mammal_tree,color=cols,ftype="off",
  direction="upwards",split.vertical=TRUE,lwd=1,
  lend=1)
cols<-setNames(c("transparent","black"),0:1)
plot(painted.mammal_tree,color=cols,ftype="off",
  direction="upwards",split.vertical=TRUE,lwd=3,
  lend=1,add=TRUE)

plot of chunk unnamed-chunk-19