1 | Introduction:
Small open reading frames (sORFs, also termed smORFs) below 100 codons were excluded by the FANTOM genome annotation consortium to filter out the high rate of false positive sORFs that were detected under this size in eukaryotic long noncoding RNAs. A similar small size cutoff of 50 codons was applied for prokaryotic gene annotation. These cutoffs were expedient, since the number of known genes pales in comparison to the number of background ORF-like sequences within a genome, most of which are not expressed, but resulted in systematic under-detection of functional sORFs. Many expressed sORFs have now been discovered: recent studies have converged on hundreds of previously unannotated sORFs in bacteria and thousands in human, and multiple CRISPR screens have suggested that hundreds of human sORFs are required for cell survival and proliferation. The emerging relevance of sORFs to infectious disease, the microbiome, and human disease opens the possibility of new therapeutic strategies, and, as such, consortium efforts to enter translated sORFs into the genome annotation are underway.
Early discoveries of functional sORF-encoded polypeptides, such as humanin in human, tarsal-less/polished rice in Drosophila and SgrT in bacteria, occurred individually. As a result, the global nature of sORF translation was not recognized until the seminal demonstration of ubiquitous translating ribosome occupancy outside canonical reading frames by Ingolia et al. and subsequent confirmation of the presence of a large number of unannotated sORF translation products with mass spectrometry. The products of sORF translation have been termed small proteins, microproteins, micropeptides, sORF-encoded polypeptides (SEPs) and, evocatively, ghost proteins; we will utilize the term microprotein throughout this review. In addition, longer, non-annotated proteins, in some cases referred to as alternative proteins, particularly when they overlap canonical proteins, have also been identified, but they will not be specifically discussed herein. For the purpose of this review, our definition of a eukaryotic microprotein will extend to previously unannotated proteins below 130 amino acids, as many previously undetected ORFs of this length have been reported in human cells. Prokaryotic microproteins are typically categorized as less than or equal to 50 amino acids in length; however, our definition in this work will extend to 70 amino acids since many unannotated microproteins of this size have been detected in multiple bacterial species.
Multiple classes and regions of RNA, both coding and noncoding, have been shown to harbor sORFs in prokaryotes and eukaryotes (Figure 1). Functional sORFs have been discovered in small and long noncoding RNAs (ncRNAs and lncRNAs), antisense lncRNAs, microRNA precursors, and circular RNAs[7,43] in bacteria, plants and other eukaryotes. Interestingly, an increasing number of genes have been shown to exert functions both at the level of the RNA and of the encoded microprotein, such as sgrST , azuCR , Spot42/SpfP, and some micro RNAs (miRNAs). Additional classes of sORFs have been identified in multicistronic mRNAs alongside canonical protein coding sequences (CDS) in both prokaryotes and, surprisingly, eukaryotes. sORFs in 5′ untranslated regions (UTRs) relative to an annotated CDS are referred to as upstream ORFs (uORFs). Importantly, while eukaryotic uORFs have long been regarded as cis-translational regulators that generally decrease translation efficiency of the downstream CDS, in some instances, uORFs encode microproteins with independent cellular functions in trans, such as MIEF1-MP, which regulates mitochondrial protein translation, and ASDURF, which is a previously unidentified component of the prefoldin-like module of the PAQosome. Some sORFs that initiate in the 5′ leader extend into and overlap the CDS in an alternative reading frame, and can be termed overlapping uORFs (o.uORFs), such as human alt-RPL36, which overlaps ribosomal protein L36 and regulates the phospholipid transporter TMEM24. It is important to note that, because they are translated in a different reading frame, o.uORF polypeptide amino acid sequence is completely different from that of the downstream, overlapping annotated protein. At the other end of the mRNA, the 3′ UTR has also been found to encode microproteins from downstream ORFs (dORF), which may also regulate CDS translation. An emerging class of frameshifted sORFs occur internally within a protein CDS. These nested sORFs lie completely within the main ORF with translation initiating downstream of the main ORF start codon, and translation terminating upstream of the main ORF stop codon. Nested sORFs can occur in the +2 or +3 (same-strand, frameshifted) reading frames (Figure 2), such asE. coli GndA and human alt-FUS. Surprisingly, these findings point to the fact that mammalian mRNAs may be multicistronic or dual coding. While prokaryotic organisms are known to express polycistronic mRNA transcripts termed operons, and compact viral genomes have long been known to contain overlapping open reading frames, eukaryotic transcripts have long been thought to be monocistronic as a result of the scanning model of translation initiation. Importantly, microproteins and longer alternative proteins encoded in each of these classes of sORFs have been shown to be functional. In summary, coding and noncoding regions of both prokaryotic and eukaryotic genomes encode functional sORFs in loci that are denser and more complex than previously presumed.