Sampling and DNA Amplification
We sampled 38 species of cetaceans from nine families; twenty-three species belonged to Delphinidae, a family that includes the highest relative brain sizes as measured by the encephalization quotient (EQ) . We also included members of Mysticeti (baleen whales) and Physeteridae (giant sperm whale) that represent taxa with the largest absolute brain sizes that have ever evolved [4, 9, 10]. All cetacean taxa included in this study are listed in Additional File 1: Table S1.
We designed primers for exon 8 of MCPH1 from an alignment of sequences from the 2.0 build of the Bos taurus (domestic cow) genome and genomic data for Tursiops truncatus available via Genbank. Exon 8 consists of the highly variable inter-BRCT domain sequence (IBS) and makes up approximately half of the coding region of MCPH1. The IBS region shows evidence of positive selection on the lineage leading to humans [16, 19]. A small portion of intron 7 and most of exon 8 (total amplicon = ~1237 basepairs [bp]) were PCR amplified using MCPH1INT7F1 (5' GCT TTA TCA CGT TAT GGG CGG AC 3') or MCPH1INT7F2 (5' GCT TTA TCA CGT TAT GGG CGG ACT G 3') in the forward direction and MCPH1EX8R1 (5' GAG AGA CCA GTA AAG GAG GTT CAC 3'), MCPH1EX8R2 (5' AGG AGG TTC ACA TAC TTT CAC TAC 3'), or MCPH1RSeq2 (5' CGG GAG AAA AGT AAT CAT CG 3') in the reverse direction. PCR products were sequenced using the above primers, as well as MCPH1F1 (5' AAA ACG AGA AGT GTC CGT CCG C 3'), MCPH1F2 (5' CCT GTC TGC TAC GCC ATC TGT AAC 3'), MCPH1FSeq3 (5' TTT CCA GGA GAG AGA GGA CC 3'), MCPH1R1 (5' TTT CCA CAT CCC AGT CGC CTA C 3'), and MCPH1RSeq1 (5' TCT CCT TGA GAT TAT CGG G 3'). PCR was performed using 1 μl template DNA, 100 pmol of each primer, 1X AccuPrime PCR Buffer I (Invitrogen), and 1 unit AccuPrime Taq DNA Polymerase High Fidelity (Invitrogen) in a 50 μl reaction. PCR conditions consisted of 45 cycles of 1 min denaturation at 94°C, 1 min annealing at 58°C, and 1 min elongation at 68°C. All new sequences were deposited in Genbank (accession numbers HQ873570-HQ873608).
Data Set Compilation and Alignment
Three data sets were assembled. To investigate selection pressure over the whole gene, the "whole-gene" data set was compiled using complete MCPH1 coding sequences downloaded via Ensembl (Homo, Pan, Pongo, Macaca, Callithrix, Tupaia, Rattus, Mus, Canis, Equus, Bos, and the delphinid cetacean, Tursiops). Species were selected due to phylogenetic position, completeness, and quality of available sequence. We also compiled a data set consisting of sequences derived from exon 8 and a small segment of intron 7 (the "exon 8" data set), including those sequenced here and sequences downloaded for a wider range of species, largely primates, from the Ensembl and Genbank databases. Genbank accession numbers for all downloaded sequences are listed in Additional File 1. The third data set ("reduced exon 8") consisted of a reduced set of taxa that excluded sequences that were less than 90% complete (e.g., Balaenoptera acutorostrata, Kogia sima) and sequences from species belonging to Monodontidae due to the presence of a stop codon near the end of the sequence (see Additional File 1). Sequences for all data sets were aligned using CLUSTAL W  with a gap-opening penalty of 10 and a gap-extension penalty of 1. Exonic indels were multiples of 3 bp and were adjusted by eye to reflect the open reading frame.
Please see Additional File 1 for a full account of phylogenetic methods.
Tests for Positive Selection and Selective Constraints
Positive selection acting on the complete coding sequence of MCPH1 was examined by estimating ω (dN/dS, the ratio of the rate of nonsynonymous substitution to the rate of synonymous substitution) using the site models in the codeml program of PAML 4.0 . An unrooted species tree of Boreoeutheria [39–41] was used as an input tree with Rodentia, Primates + Scandentia, and Laurasiatheria positioned as a basal polytomy. Model M1a (nearly-neutral: ω0 < 1, ω1 = 1) was compared to M2a (positive selection: ω0 < 1, ω1 = 1, ω2 > 1) and M8a (nearly neutral; beta distribution: 0 < ω0 < 1 and ω1 = 1) was compared to model M8 (positive selection: beta distribution: 0 < ω0 < 1 and ω1 > 1) by performing likelihood ratio tests (LRTs) and assessing their significance using a χ2 distribution (two degrees of freedom [df] for M1a vs. M2a; one df for M8 vs. M8a). For the comparison of M8 vs. M8a, we halved the p-value as suggested by . A Bayes Empirical Bayes (BEB) analysis was implemented to calculate posterior probabilities of positively selected sites using the M2a and M8 models as described in . An individual site was considered to have undergone positive selection (ω > 1), if the posterior probability was ≥0.95. Variation in ω among branches was examined using the free-ratio model in which each branch of the tree was given a separate ω-value. The fit of the free-ratio model was compared to model M0 in which all branches in the tree were assigned the same ω value using the LRT with 20 degrees of freedom for the whole MCPH1 data set.
To further investigate selection on exon 8 of MCPH1, for which 38 cetacean species were sequenced (Additional File 1: Table S1), all intronic sequence was deleted, as well as sequence downstream of a stop codon found near the 3' end of exon 8 in monodontids (see Additional File 1). All "exon 8" analyses were implemented using a species tree and also a gene tree derived from phylogenetic analysis of the MCPH1 data. The species tree employed was a composite gathered from several sources [39–42]; the gene tree for PAML analyses was the optimal topology recovered by maximum likelihood (ML) analysis and was consistent with the 50% majority rule consensus of Bayesian trees. In both the species tree and the gene tree, Loxodonta, Euarchontoglires, and Laurasiatheria were treated as a basal trichotomy. The ratio ω was then estimated in the codeml program of PAML 4.0 , as described above for the complete coding sequence of MCPH1. Separate analyses were conducted using the site models for all mammals, cetaceans only, odontocetes only, delphinids only, mysticetes only, primates only, and all mammals excepting cetaceans and primates. By analyzing various subclades of mammals using the site models, the goal was to determine whether evidence for positive selection is a general feature of MCPH1 in mammals, or is instead restricted to only certain mammalian lineages. In addition, analyses using the species tree and gene tree were also conducted for the "reduced exon 8" data set (see above).
For the taxon-rich exon 8 alignment, ω was also estimated for individual branches and groups of branches . Several branch model analyses were conducted including a free-ratio model (all branches separate), two-ratio models in which one branch was given a separate ω (repeated for the branch leading to the last common ancestor [LCA] to each of the following groups: Cetacea, Mysticeti, Odontoceti, Delphinoidea, Delphinidae, as well as each of the terminal branches leading to Physeter and Orcinus), and two-ratio models in which one stem-based clade was given a separate ω (repeated for Cetacea, Mysticeti, Odontoceti, Delphinoidea, and Delphinidae). All models listed above were tested against a one-ratio model (M0) using LRTs and 1 df. Branch models with p ≤ 0.004 (0.05/13) after Bonferroni correction for multiple tests were interpreted as having a significantly different ω on the "foreground" branches of interest in comparison to the "background" ω on all remaining branches of the tree. Comparisons between background branches and foreground branch(es) at the base of lineages or within whole clades were conducted to test whether the pattern of selection on these branches was significantly distinct from the rest of the tree. Branches were selected for comparison due to their proposed relation to an evolutionary change in relative and/or absolute brain size. For example, in the case of Odontoceti and Delphinoidea, Marino et al.  proposed that these clades mark shifts associated with increases in relative brain size. Delphinidae was selected due to the high relative brain size of multiple species within the clade . The branches leading to Mysticeti, Physeter, and Orcinus were also tested because these lineages terminate at species with large absolute brain sizes [8, 9] (Figure 1, Additional File 1: Table S2). For all PAML branch models, both a species tree and the MCPH1 gene tree were used as input trees.
Variation of MCPH1 within Cetacean Species
We recorded heterozygous sites for each cetacean MCPH1 sequence that was generated in this study. Sites in sequencing chromatograms that showed nearly equal height peaks for two different bases at the same position were considered true heterozygous sites (i.e., due to divergent alleles that were PCR-amplified from that specimen). The position (first, second, or third codon) and the nature (synonymous or nonsynonymous, transition or transversion, etc.) of change at each heterozygous site were noted, and this variation within species was compared to the pattern of nucleotide substitution in MCPH1 between cetacean species. For Tursiops truncatus and Delphinus capensis, we compared sequences derived from two individuals of each species and checked for intraspecific variation in MCPH1.
Analysis of Associations Between ω and Phenotype
To test prior hypotheses of association between ω and phenotype, we compiled data on absolute brain and body mass for 27 cetaceans for which we had molecular data; these traits vary widely among extant cetaceans (Figure 1). Absolute brain and body mass data came from multiple, previously published sources [8, 9, 43, 44]. For Platanista minor, we used measurements derived from the very close relative, Platanista gangetica [42, 43]. Because some body masses above came from individual animals and may not represent the full size range of a particular species, we also gathered data on maximum body size from species accounts in . In addition, a measure of relative brain mass, the encephalization quotient (EQ), was calculated for each species using a standard allometric equation of mammalian brain mass vs. body mass from : EQ = brain mass/0.12 (body mass)0.67. Absolute brain size, body size, maximum body size, and EQ were then log-transformed for statistical analysis (see below). Morphological data were deposited online in Supplemental Table S2.
We used the method of  to assess associations between ω and various phenotypic variables: absolute brain mass, absolute body mass, maximum body mass derived from , and EQ. For each of the 27 cetacean species that were scored for phenotypic data (see above), we calculated the average "root-to-tip" ω along branches extending from the last common ancestor of Cetacea to each extant cetacean species in our dataset. This approach has the advantage of producing a measure of selection that takes the entire evolutionary history of a lineage from a common ancestor into account and which is a property of the species tips in a way that is more comparable with extant phenotypes. This procedure also negates the issue of temporal effects on ω . Root-to-tip ω values were estimated in PAML 4.0  using a 2-rate branch model and the species tree of . Phylogenetically controlled regression analyses of log-transformed root-to-tip ω versus each log-transformed morphological variable were performed using BayesTraits [46, 47] and the time-calibrated tree of  to explicitly test for gene-phenotype associations. The significance of the regression analyses was determined using a one-tailed t-test in the positive direction, in order to test the hypothesis that there is a positive relationship between estimated selection pressure on MCPH1 (ω) and the phenotypic variables as in . Each regression was performed across all cetaceans and just within the Odontoceti.
In addition to the EQ, we also explored two other methods of assessing relative brain size. In the first method, we calculated residuals from a regression between brain and body mass and used these in a subsequent regression analysis with root-to-tip ω. In the second method, we performed a multiple regression of brain and body mass with root-to-tip ω. Both of these approaches produced similar results to the regression using the EQ, and are not shown here.