Monday, February 14, 2011

Reading a phylogeny from Newick (cont.)

In the last couple of posts, I have discussed a couple of basics regarding the input and storage in memory of phylogenetic trees in R. In particular, in the first post, I described the structure of the {ape} "phylo" object; and in the second, I described a standard algorithm (not of my own invention, but originally conveyed to me by Luke Harmon) for parsing a string in Newick format into a tree consisting of nodes and edges. The ultimate goal is not to read Newick trees into R (which read.tree() does quite well already), but to use this function as the basis for reading in other information from file - such as (for instance) the stochastic character map of a binary or multistate character.

Now our task is to translate the algorithm described in Saturday's post into computer code which will create the structure described in Friday's post. I have done this, and today I posted this function to my R-phylogenetics development page here; however, I remind the reader that this function is really provided for development/experimentation purposes only as it does nothing that is not already implemented in the {ape} function read.tree().

Ok, the first thing we need to do is figure out how many internal nodes and tips we have in the tree. We'd like to do this a priori, because that way we have two important pieces of information before starting: 1) how many rows to put in our matrix $edge; and (more importantly) 2) what number to assign to the root. The way I have done this is by taking one pass through the Newick string and counting the number of commas and the number of right brackets. It is my belief [readers, please correct me if this is wrong!] that the number of commas + 1 will always correspond to the number of tips in the tree; and that the number of right brackets will always correspond to the number of internal nodes.

Once we have figured out how many tips and internal nodes there is, it is straightforward to count the number of edges in the tree. This will just be the sum of the number of internal nodes + the number of tips - 1. The logic for this is as follows: every tip and internal node have an edge leading to them, except for the root node in the tree (hence the -1).

Now with these preliminaries completed, we can proceed to our algorithm for parsing the Newick string. Here, I start with a "while(tree[i]!=";")" loop that will terminate when the end character of our Newick style tree (";") is reached. Inside this loop, we can just program our character-by-character parser to build the tree based on the algorithm of my previous post and animation. I programmed this function more or less as illustrated, with a couple of caveats: 1) I never actually visit terminal nodes (I just create them as daughters, but then never update the status of "current node" to the daughter); and 2) in progressing backwards through the tree in response to character ")", the structure of the "phylo" object prevents easy backward movement through the tree. For this, I used the {base} function match() to match my current node's parent to the row number of prior nodes in the matrix edge.

Once all the elements of the tree are created, I just join them together in a named list using the function list(); and then I assign this list the class "phylo". Finally, I return the list to the R environment - and we're done!

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.