Monday, September 3, 2012

Bug fix for nodeHeights and method to get the tip heights for very large trees

A while ago, and for a different project, I programmed a function (nodeHeights) that returns a matrix containing the height above the root for all the nodes indexed in tree$edge. This may not have been my original purpose, but I now call nodeHeights from within a number of different phytools functions. In addition, although it was designed primarily as an internal utility function, nodeHeights is also exported to the NAMESPACE and can thus be called by any R user running phytools.

The problem is that I realized today that nodeHeights relies implicitly on a specific ordering of the edges in the tree - specifically, the "cladewise" order. (For more on the different orderings of the edges of a "phylo" object, see help(reorder.phylo).) Luckily, "cladewise" is the most common edge order in R (creating by functions that read and simulate trees). In addition, a different ordering will generally cause a fatal error (so this would be obvious).

I have now fixed this problem in nodeHeights. I did this by adding basically two lines of code to the function.

First, I added a line to call reorder.phylo. With this function we reorder the tree into "cladewise" order, so the original code of the function will work:

if(attr(tree,"order")!="cladewise"||is.null(attr(tree,"order")))
   t<-reorder(tree)
else t<-tree


Next, after we have computed node heights, we want to reorder our matrix into the edge order of our original input tree. We can do this by matching the second column of $edge of both trees (since every edge only has one daughter on a non-reticulate tree).

if(attr(tree,"order")!="cladewise"||is.null(attr(tree,"order")))
   o<-apply(matrix(tree$edge[,2]),1,function(x,y)
      which(x==y),y=t$edge[,2])
else o<-1:nrow(t$edge)
return(X[o,])


The fixed version of this function is here, and I have also built a new version of phytools which is not on CRAN but can be downloaded and installed from source (here).

There was also a recent R-sig-phylo thread in which a user inquired as to how one could obtain the height above the root for all the tip taxa in a tree. Both Simon Blomberg & I responded that this could be done easily with h<-diag(vcv.phylo(tree)). The only problem with that is that for very large trees (e.g., >10000 tips) there may not be enough system memory (or enough memory allocated to R) to first build the N × N matrix produced by vcv.phylo so that we can compute its diagonal. Below, is a solution which uses nodeHeights which is actually slower than vcv.phylo, but will work on trees for which a call to vcv.phylo would cause a memory allocation failure.

tipHeights<-function(tree){
    H<-nodeHeights(tree)
    n<-length(tree$tip)
    x<-H[tree$edge[,2]%in%1:n,2]
    names(x)<-tree$tip[tree$edge[tree$edge[,2]%in%1:n,2]]
    return(x)
}


We can try them out:

> t5000<-rtree(n=5000)
> a<-tipHeights(t5000)
> b<-diag(vcv(t5000))[names(a)] # same order
> all(a==b)
[1] TRUE
> t10000<-rtree(n=10000)
> a<-tipHeights(t10000)
> b<-diag(vcv(t10000))
Error in diag(vcv(t10000)) :
error in evaluating the argument 'x' in selecting a method for function 'diag': Error: cannot allocate vector of size 762.9 Mb


(Obviously, in this case I could increase my memory allocation to R, but this will only help me to a point. For 32-bit R in Windows, this point is 4GB.)

That's it!

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.