The increasing number of genomes available has made it possible to compare the genes and determine in which branch of the phylogenetic tree they are likely to have originated. This has led to the identification of many genes that are species or lineage-specific. As they have no homologues in other species they must have originated from previously non-genic parts of the genome, or de novo. However, some researchers have claimed that errors in the detection of homologues by sequence similarity search methods, such as BLAST, may largely explain this. One way to assess how many genes are missed in these searches is to perform sequence evolution simulations along a phylogenetic tree and then use BLAST to recover the homologues (Albà and Castresana, 2007). If we fail to detect them we can say we have a sensitivity problem. This will result in a percentage of the genes being misclassified in younger classes.
The simulations performed to date have all indicated that the percentage of error for proteins is relatively small (4.7% to 13.85%) even at long distances (from mammals to fungi or plants). As expected, the problem is worse for distant comparisons than for closer ones. For example human and macaque, separated some 24 Millions of years ago, only display 6 substitutions every 100 nucleotides. Lack of BLAST sensitivity is not going to be a problem for these species even when comparing neutrally evolving sequences. For more distant comparisons it depends on whether the sequence is under selection or not. Proteins tend to contain motifs that are highly conserved and for this reason BLAST works reasonably well even at long distances. The results of the simulations support the idea that many genes are likely to have originated recently. For example only 14 S.cerevisiae proteins would fail to find homologues in S.paradoxus or S.mikatae due to BLAST errors (Moyers and Zhang, 2016). Although this is interpreted by the authors of the paper as problematic, the strong contrast with the observed data (445 genes restricted to these species in Carvunis et al.,2012) supports the notion that new genes are continuously emerging.
Mar Albà
Update: A reply to Moyers & Zhang has been published in bioRxiv No evidence for phylostratigraphic bias impacting inferences on patterns of gene emergence and evolution