1 | Introduction:
Small open reading frames (sORFs, also termed smORFs) below 100 codons
were excluded by the FANTOM genome annotation consortium to filter out
the high rate of false positive sORFs that were detected under this size
in eukaryotic long noncoding RNAs. A similar small size cutoff of 50
codons was applied for prokaryotic gene annotation. These cutoffs were
expedient, since the number of known genes pales in comparison to the
number of background ORF-like sequences within a genome, most of which
are not expressed, but resulted in systematic under-detection of
functional sORFs. Many expressed sORFs have now been discovered: recent
studies have converged on hundreds of previously unannotated sORFs in
bacteria and thousands in human, and multiple CRISPR screens have
suggested that hundreds of human sORFs are required for cell survival
and proliferation. The emerging relevance of sORFs to infectious
disease, the microbiome, and human disease opens the possibility of new
therapeutic strategies, and, as such, consortium efforts to enter
translated sORFs into the genome annotation are underway.
Early discoveries of functional sORF-encoded polypeptides, such as
humanin in human, tarsal-less/polished rice in Drosophila and SgrT in
bacteria, occurred individually. As a result, the global nature of sORF
translation was not recognized until the seminal demonstration of
ubiquitous translating ribosome occupancy outside canonical reading
frames by Ingolia et al. and subsequent confirmation of the
presence of a large number of unannotated sORF translation products with
mass spectrometry. The products of sORF translation have been termed
small proteins, microproteins, micropeptides, sORF-encoded polypeptides
(SEPs) and, evocatively, ghost proteins; we will utilize the term
microprotein throughout this review. In addition, longer, non-annotated
proteins, in some cases referred to as alternative proteins,
particularly when they overlap canonical proteins, have also been
identified, but they will not be specifically discussed herein. For the
purpose of this review, our definition of a eukaryotic microprotein will
extend to previously unannotated proteins below 130 amino acids, as many
previously undetected ORFs of this length have been reported in human
cells. Prokaryotic microproteins are typically categorized as less than
or equal to 50 amino acids in length; however, our definition in this
work will extend to 70 amino acids since many unannotated microproteins
of this size have been detected in multiple bacterial species.
Multiple classes and regions of RNA, both coding and noncoding, have
been shown to harbor sORFs in prokaryotes and eukaryotes (Figure 1).
Functional sORFs have been discovered in small and long noncoding RNAs
(ncRNAs and lncRNAs), antisense lncRNAs, microRNA precursors, and
circular RNAs[7,43] in bacteria, plants and other eukaryotes.
Interestingly, an increasing number of genes have been shown to exert
functions both at the level of the RNA and of the encoded microprotein,
such as sgrST , azuCR , Spot42/SpfP, and some micro RNAs
(miRNAs). Additional classes of sORFs have been identified in
multicistronic mRNAs alongside canonical protein coding sequences (CDS)
in both prokaryotes and, surprisingly, eukaryotes. sORFs in 5′
untranslated regions (UTRs) relative to an annotated CDS are referred to
as upstream ORFs (uORFs). Importantly, while eukaryotic uORFs have long
been regarded as cis-translational regulators that generally decrease
translation efficiency of the downstream CDS, in some instances, uORFs
encode microproteins with independent cellular functions in trans, such
as MIEF1-MP, which regulates mitochondrial protein translation, and
ASDURF, which is a previously unidentified component of the
prefoldin-like module of the PAQosome. Some sORFs that initiate in the
5′ leader extend into and overlap the CDS in an alternative reading
frame, and can be termed overlapping uORFs (o.uORFs), such as human
alt-RPL36, which overlaps ribosomal protein L36 and regulates the
phospholipid transporter TMEM24. It is important to note that, because
they are translated in a different reading frame, o.uORF polypeptide
amino acid sequence is completely different from that of the downstream,
overlapping annotated protein. At the other end of the mRNA, the 3′ UTR
has also been found to encode microproteins from downstream ORFs (dORF),
which may also regulate CDS translation. An emerging class of
frameshifted sORFs occur internally within a protein CDS. These nested
sORFs lie completely within the main ORF with translation initiating
downstream of the main ORF start codon, and translation terminating
upstream of the main ORF stop codon. Nested sORFs can occur in the +2 or
+3 (same-strand, frameshifted) reading frames (Figure 2), such asE. coli GndA and human alt-FUS. Surprisingly, these findings
point to the fact that mammalian mRNAs may be multicistronic or dual
coding. While prokaryotic organisms are known to express polycistronic
mRNA transcripts termed operons, and compact viral genomes have long
been known to contain overlapping open reading frames, eukaryotic
transcripts have long been thought to be monocistronic as a result of
the scanning model of translation initiation. Importantly, microproteins
and longer alternative proteins encoded in each of these classes of
sORFs have been shown to be functional. In summary, coding and noncoding
regions of both prokaryotic and eukaryotic genomes encode functional
sORFs in loci that are denser and more complex than previously presumed.