Strains of Neurospora intermedia used in the study
Two strains of Neurospora intermedia, FGSC 8782 and FGSC 8882, were used in this study. FGSC 8782 is of mating-type a and was originally collected in Homestead, Florida, U.S.A., while FGSC 8882 is of mating-type A, and was collected in Puerto Cortes, Honduras. Both strains belong to the phylogenetic subgroup NiA of N. intermedia. The two strains are highly inter-fertile and originate from a population that shows reinforced reproductive isolation with sympatric strains of N. crassa. In addition, we used representatives of twelve heterothallic (self-sterile) species and subspecies (Additional file 2: Table S2). All strains were obtained from the Fungal Genetics Stock Center (FGSC, University of Missouri, MO, USA).
Preparation of vegetative and sexual tissue samples of Neurospora intermedia
Cultures of N. intermedia for microarray analysis and EST sequencing were grown in 90-mm Petri dishes on solid synthetic crossing medium (SCM)  with 2% sucrose, covered by a layer of sterilized cellophane membrane, and in test tubes with Vogel’s minimal medium  with 1.5% sucrose. Unless otherwise specified, cultures were grown at 25°C.
The sexual sample was obtained by collecting tissue from reciprocal crosses between the two isolates. The strain to be the perithecial (maternal) parent of each cross was incubated on SCM until the Petri dish contained numerous protoperithecia (the unfertilized female reproductive structures; 11 days for FGSC 8782 and 13 days for FGSC 8882). For the production of conidia (the male fertilizing unit), each of the strains was grown on Vogel’s media for 3 days at 37°C. The conidia were collected with a spatula and suspended in water. The conidial concentration of the suspension was estimated by a hemacytometer and adjusted to 500,000 conidia per mL. During fertilization, 0.5 mL of the suspension (about 250,000 conidia) was spread with a sterile glass spreader onto the Petri dish containing the maternal tissue of the opposite mating-type. Fertilization was performed by transferring conidia from the mat a strain (8782) to the Petri dish containing maternal tissue of the mat A strain (8882), and vice versa, to produce two reciprocal crosses. After fertilization, the crosses were incubated in darkness until harvest. The sexual tissue was sampled at five different time points: 3, 12, 24, 36, and 48 hours after the conidial suspension was added. This time series had previously been verified by microscopy to represent tissue at the stage of development until karyogamy (i.e., presence of croziers ) under our growth conditions. At harvest, the tissue was scraped off the cellophane with a scalpel and collected in 1.5 ml Eppendorf tubes. The developing perithecia of the harvested tissue were not separated from the surrounding hyphae, and thus, the sexual tissue contained a background level of vegetative tissue. Each tube was immediately snap frozen after harvest in liquid nitrogen and stored at −70°C. Samples (RNA or tissue, see below) from the 5 time points from the two strains were pooled in equal amount and constitute our sexual sample in subsequent microarray and EST-analyses.
The vegetative tissue of each strain was grown for three days on SCM in constant darkness. At the time of harvest, the mycelia were inspected under a dissecting microscope to verify absence of protoperithecia and conidia in the culture. Thus, the same media and culture conditions were used for sexual and vegetative tissues, and the developmental stage of the tissue was the only difference between them. The same method for harvesting was used as with the sexual tissue. A pooled sample of RNA or mycelia of the two strains was used as the vegetative sample in subsequent microarray and EST-analyses, respectively (see below).
RNA and cDNA processing
We followed Clark et al.  for RNA and cDNA processing for the microarray study. Briefly, we extracted total RNA from 50–100 mg portions of the tissue using the TRI REAGENT kit (Molecular Research Center, Inc. Cincinnati, OH). The tissue was homogenized by grinding in a 7 mL Dounce glass tissue grinder and by using Qiagen Qiashredder columns (Qiagen, Chatsworth, CA). After extraction, equal amounts (μg) of the total RNA from the sexual tissue of all five time-points from both strains were pooled together to constitute the sexual sample. For the vegetative sample, total RNA from mycelia of each strain was pooled in equal amounts. MRC oligo(dT)-Cellulose columns (Molecular Research Center, Inc. Cincinnati, OH) were used for poly(A) + mRNA purification. Samples for each hybridization were independently subjected to reverse transcription using Superscript II reverse transcriptase (Invitrogen), 0.5 μg oligo(dT) primers (Invitrogen), and 2 μg poly-A mRNA.
The cDNA was coupled with Cy3 or Cy5 labeled probes (Amersham Biosciences, Uppsala, Sweden), then purified using the QIAquick PCR purification kit (Qiagen). Four competitive hybridizations, including two dye-swaps, were performed between cDNA of the sexual and vegetative samples. Hybridizations were made in the dark at 55°C for 16 hours. The microarrays used in this study [28, 63] were based on the genome sequence of Neurospora crassa, a close relative to N. intermedia. The whole-genome-spotted oligonucleotide microarray contained 9,826 open reading frames, each represented by a 70mer oligonucleotide probe that was robotically printed on CMT-GAPS-aminopolysilane-coated glass slides (Corning, Corning, NY) at the Yale University Center for Genomics and Proteomics, following the procedure by Kasuga et al. . The divergence between N. crassa and N. intermedia is very low, with the proportion of variable sites in coding regions of housekeeping genes being 1.2% . Therefore, we considered this 70-mer array, designed from the N. crassa genome sequence, to be appropriate for the analyses of expression of N. intermedia isolates, especially when analyzing differences in relative expression between strains of N. intermedia and not between N. intermedia and N. crassa[65, 66].
Microarray data acquisition and analysis
An Axon GenePix 4000B microarray scanner was used to acquire a two-channel color image of the array fluorescence. The microarray spots were located by using the GenePix Pro 6.0 software package (Axon Instruments, Foster City, CA) together with the array list file . Before data collection each microarray was manually screened and adjusted, and abnormal spots were excluded. We normalized the ratio results as in Townsend  using the online tool available at The Filamentous Fungal gene Expression Database [29, 69]. A Bayesian Analyses of Gene Expression Levels (BAGEL) was performed using the software UBAGEL 3.6 [30, 31] on the normalized data. For each gene, the relative expression level and a credibility interval of 95% was calculated.
We used the results from the BAGEL analysis to categorize genes among three exclusive alternatives: sexual, vegetative and constitutive. The sexual gene category contains all genes that exhibited a statistically significantly (P < 0.05) higher expression in the sexual tissue relative to the vegetative. The vegetative category contains all genes that exhibited a statistically significant higher expression in the vegetative tissue compared to the sexual, and the constitutive category consisted of all genes that did not exhibit differential expression between the sexual and vegetative samples. Q-values were calculated from our P-values using the software QVALUE  with the default settings. P < 0.05 corresponded to Q < 0.62 for the sexual category genes, and corresponded to Q < 0.51 for the vegetative category genes. Because the aim was to divide the genes into their most likely gene categories, the false discovery rate is not critical (inclusion of inappropriately classified genes would decrease effect size, and therefore be conservative). Nevertheless, we performed an additional analysis on a more strictly defined sexual gene category only including genes with Q < 0.10. The Q-value cutoff of Q < 0.10 was selected to minimize the false positives while still allowing enough genes in the category to convey statistical power in the analyses. We refer to this subdivision of the sexual category as the “sexual Q < 0.1”.
cDNA-library construction and EST sequencing of Neurospora intermedia
RNA extraction, subtractive cDNA library construction and EST sequencing were outsourced to Agencourt Bioscience Corporation (Beverly, MA). Total RNA was extracted from both the vegetative and the sexual tissue (pooled in equal amounts from both strains at different time points, see above), by using the Agencourt RNAdvance Tissue Kit. A subtracted cDNA library, enriched for sequences of genes expressed in the sexual sample, was prepared by the Suppression Subtractive Hybridization method  in which the cDNA from the vegetative sample was used to subtract the genes shared by the two samples. In addition, a separate library was constructed from the vegetative sample. Sequence data was composed of 5,376 reads, sequenced by using ABI (Applied Biosystems) sequencing technology: 3,840 of which were from the subtracted sexual cDNA library (1,920 clones, sequenced in two directions) and 1,536 of which were sequenced from the vegetative library (768 clones, sequenced in two directions). Thus, 1,920 clones from the subtracted sexual cDNA library and 768 clones from the vegetative library were sequenced in both forward and reverse directions. The EST-data have been deposited in EMBL/NCBI’s EST database, and are accessible through the accession number HE957083-HE961812.
EST data assembly
The ESTs were processed and assembled de novo with the phred/crossmatch/phrap toolchain [72, 73]. Basecalling, quality filtering and trimming was carried out with phred v0.071220.b using the default quality cutoff settings, after which crossmatch v1.090518 was used to screen out remaining vectors using the UniVec database . Phrap was used to assemble the sequence with overlapping regions into contiguous sequences referred to here as Phrap contigs. After filtering and assembly, the data consisted of 4,984 reads.
Three-way alignments of publicly available Neurospora sequences
The 3-way alignments were generated from N. crassa, N. discreta and N. tetrasperma (version 1 [77, 78]) by using the SYNERGY algorithm , which identifies orthologs based on the sequence similarity from pair-wise BLAST of each proteome (expectation cutoff < 1E-5) and synteny of loci across the 3 genomes. In total, a 3-way alignment was completed for 5,635 individual genes that were found to be unambiguously orthologous among the 3 species (Stajich et al., unpublished), corresponding to roughly half of the genes in the N. crassa genome.
Orthology search for the Neurospora intermedia sequences
A BLAST database  was built from the 3-way alignments and the N. crassa genome (downloaded on April 13, 2011) and served as reference for establishing orthology for the new and pre-processed N. intermedia sequence reads. Gene IDs were assigned to the N. intermedia reads by running megaBLAST  queries against the 3-way database. The queries were configured to use a length cutoff of 100 aligned bases to register a match. Of the 4,984 reads retained after assembly, megaBLAST against the 3-way database identified 2,573 reads as usable, corresponding to only a single gene target. By running BLAST on the Phrap-generated contigs built from the total read pool, 104 additional reads were identified, and finally, 188 reads were unambiguously identified using the less stringent BLASTN algorithm and an alignment cutoff at 60 bases. The 1,939 remaining sequence reads were not used in the subsequent analyses, but were run in BLAST against the N. crassa genome to provide additional gene statistics using the same algorithms, prioritizing megaBLAST hits over BLASTN hits. In total, 4,585 of the N. intermedia reads yielded a BLAST-hit to any of 1,392 N. crassa genes.
Adjustments of the 4-way alignments
A custom pipeline was written in Bash/Perl to assemble the identified N. intermedia data and align it to the three-way alignments on a gene-by-gene basis. For each of the genes in the 3-way set, we used BLAST to find the best Phrap contig and unassembled singlets to establish direction and alignable boundaries. These cues were used to identify local subalignments against which each Phrap product could be aligned using the G-INS-i algorithm and seed method in MAFFT [82, 83]. Frame-shifting inserts were filtered out and incomplete codons or premature stop codons were masked as missing data. The aligned N. intermedia data was merged into a single consensus sequence. In cases where aligned fragments were overlapping but had not been assembled into a contig by Phrap, remaining sequence length and contig size were used as tokens of sequence quality to resolve which fragment the consensus should be based on. Finally, the new four-way alignment was filtered so that only gap-less codon positions with data from all four species were included in downstream analyses. The program Gblocks  was used to remove poorly aligned regions before further analyses.
Identification of rapidly evolving genes
To identify rapidly evolving genes in Neurospora, we invoked two global ratio models using the maximum likelihood program codeml, included in the PAML package version 4.3 [32, 33]: one in which the global dN/dS was estimated as a free parameter, and one in which it was fixed to the mean dN/dS for all genes. The mean dN/dS was estimated by running the codeml analysis on a concatenated alignment of all 4-way orthologous gene alignments. These two models are nested, and differ in one estimated parameter, so a log-likelihood ratio test (LRT) with one degree of freedom was used to to evaluate the fit to the data for the two models, and thereby identify genes with a dN/dS significantly deviating from the mean. In the phylogeny of the four taxa included in the analyses, N. discreta represents the most ancestral branch, but the phylogenetic relationship is not fully resolved for the other three taxa . Thus, we used a completely unresolved star tree as an unrooted input phylogeny for the analyses.
To test for branch-specific rapidly evolving genes, i.e., genes evolving rapidly in the branches delineating N. intermedia, N. crassa, N. discreta or N. tetrasperma, we used the same approach but with the two-ratio model specifying the branch of interest as foreground and comparing the log-likelihood value with the value for the model where the foreground branch dN/dS was fixed at the mean for that branch. To adjust for multiple testing, Q-values were calculated from our P-values using the software QVALUE  and Q < 0.05 was considered significant.
Estimating global and branch-specific dN/dS for gene categories
To estimate the mean dN, dS, and dN/dS for each gene category distinguished by the microarray experiment (constitutive, sexual and vegetative) we implemented a bootstrap approach. For each bootstrap replicate, 10 randomly chosen genes were concatenated. The concatenation procedure was performed to overcome the uncertainty of the estimates typical for short sequences. On each concatenated alignment, we ran both the global ratio model and the four versions of the two-ratio models: one for each branch specified as the foreground branch. For each gene category, 1000 bootstrap replicates were performed, and the mean for the obtained estimates was calculated for each category. To test for differences in the distribution of the estimates for dN, dS and dN/dS between the different categories, a Wilcoxon rank sum test with continuity correction, as implemented in R, was performed, and P-values were adjusted for multiple testing using the method described by Benjamini and Hochberg . In addition, these analyses were performed on the small subset consisting of 26 sexual genes falling into the category “sexual Q < 0.1”.
PCR amplification and sequencing of candidate genes for positive selection
Strains for PCR and sequencing (Additional file 2: Table S2) were grown in test tubes with Vogel’s minimal medium with 1.5% sucrose  and DNA was extracted as in Karlsson et al.. Primers for selected genes were designed based on homologous regions of the 4-way alignment, and are given in Additional file 4: Table S3. PCRs were performed using the Expand High Fidelity PCR System (Roche Applied Science, Indianapolis, USA). PCR products were purified with ExoSAP-IT reagent (Amersham Biosciences, Uppsala, Sweden). Sequencing was performed using the BigDye Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems, Foster City, USA). The products were cleaned using BigDye XTerminator Purification Kit (Applied Biosystems), and then sequenced on an ABI3730XL (Applied Biosystems). The raw sequences were edited using the software package Sequencher 4.1.4 (Gene Codes Corporation, Ann Arbor, USA). The sequence alignments for each gene were adjusted manually using BioEdit version 7.0.0 .
Analyses of positive selection of candidate genes
To test if the higher than mean dN/dS estimates for individual sex-associated genes might be caused by positive selection on individual sites within genes, we chose five genes for additional analysis among a larger set of species, ranging from seven to twelve taxa for the different genes (Additional file 2: Table S2). The genes were selected from the list of rapidly evolving sex-associated genes based on their high dN/dS and the existence of suitable primer sites in the 4-way alignment. To evaluate the selective constraints acting on individual codon sites, we ran four models in the PAML package version 4.3 [32, 33]: M1a, M2a, M7 and M8. The models constitute two nested pairs (M1a + M2a and M7 + M8) with one model of each pair only containing site classes allowing dN/dS to vary between 0 and 1 (neutral models; M1a and M7), while the second model of each pair (selection models; M2a and M8) contains an additional site class in which dN/dS ≥ 1, thus allowing positive selection. We based the analysis on the phylogeny of the heterothallic Neurospora from Dettman et al. . Before the analysis, we excluded intronic sequences from the alignments and regions for which sequence data for less than half of the taxa were available. The Bayes empirical Bayes (BEB) calculation of posterior probabilities for site classes implemented in model M2a and M8 was used to identify codons likely to have evolved under positive selection .
Functional annotation of rapidly evolving sex-associated genes
Sex-associated genes identified to be rapidly evolving in the global model or in the N. intermedia-branch were individually studied. Translated amino acid sequences were analyzed with BLAST at NCBI and for conserved domains using the SMART protein analysis tool . SignalP 3.0  was used to search for signal peptide cleavage sites, and the big-PI Fungal Predictor program  was used to search for GPI-anchor sequences.
Statistical analyses of gene category characters
We investigated the phylogenetic distribution and chromosomal location of the genes of each gene category. The phylogenetic distribution for individual genes was taken from Kasuga et al. , which divided annotated protein coding genes from N. crassa into phylogenetic specificity classes depending on the phylogenetic relatedness in the tree of life of organisms with homologues of the gene. The chromosomal location was based on the annotation of the N. crassa genome ( downloaded on March 15, 2011). Over- and under-representation, for both phylogenetic distribution and chromosomal location, across the gene categories, were assessed using Fisher’s exact test. P-values were adjusted for multiple testing using the method described by Benjamini and Hochberg  as implemented in R. We used the significance level of P < 0.05.
Differences in codon usage between the gene categories was analyzed by performing multivariate (correspondence) analysis using the program CodonW version 1.4.4 (http://codonw.sourceforge.net/). The analysis was performed on the N. crassa gene sequences from the four taxon alignments used in the dN/dS analysis. Codon usage for gene categories was visualized by plotting the position of each gene on the resulting correspondence analysis axis 1 and 2.
Correlation between gene expression, gene length and evolutionary rate
Absolute expression for each gene was estimated by calculating the mean of the background-subtracted foreground intensity of the well-measured spots on the microarray . For genes within the sexual and the vegetative gene categories the absolute expression means were calculated on the expression in the specific tissue types only. The mean for each gene in the constitutive category and the estimate for all genes were estimated from all measurements of expression for each gene. To test for differences in the distribution of the per-gene expression between the different categories, a Wilcoxon rank sum test with continuity correction, as implemented in R, was performed. Linear regression was used to calculate the per-gene association between absolute expression and estimated global dN/dS, and between CDS length of N. crassa and estimated global dN/dS for each gene category and for the complete dataset. The linear regression analyses were performed in R.