The ability to sequence genomes and characterize their products has begun to reveal the central role for regulatory RNAs in biology, especially in complex organisms. It is now evident that the human genome contains not only protein-coding genes, but also tens of thousands of non–protein coding genes that express small and long ncRNAs (non-coding RNAs). Rapid progress in characterizing these ncRNAs has identified a diverse range of subclasses, which vary widely in size, sequence and mechanism-of-action, but share a common functional theme of regulating gene expression. ncRNAs play a crucial role in many cellular pathways, including the differentiation and development of cells and organs and, when mis-regulated, in a number of diseases. Increasing evidence suggests that these RNAs are a major area of evolutionary innovation and play an important role in determining phenotypic diversity in animals.
- non-coding RNA
- regulatory RNA
- regulation of gene expression
- small RNA
Recent advances in molecular biology, led by the large-scale sequencing of genomes and the characterization of transcriptomes, have revealed that animal genomes are far more complex and intricate than previously anticipated, containing a great diversity of sequences that can be effectors of genetic information, i.e. genes. Central to this has been the discovery that many genes do not encode proteins, but rather produce ncRNAs (non-coding RNAs), sometimes referred to as genomic ‘dark matter’. Unlike in prokaryotes and most unicellular eukaryotes, such as yeast, only a small fraction of the mammalian genome encodes protein, yet the vast majority of the mammalian genome is transcribed, producing tens of thousands of small and long ncRNAs [1,2].
The properties of RNA molecules, including their ability to form higher-order structures, to specifically hybridize with other RNAs or DNA, and to assemble RNA–protein complexes, makes them effective and versatile regulatory molecules that can direct relatively generic effector proteins to sequence-specific targets . The functional versatility of RNA has led to the ‘RNA world hypothesis’, which postulates that the dual catalytic and informational storage properties of RNA provided the molecular platform for the evolution of early life . While proteins and DNA now fulfil most catalytic and information storage roles respectively, RNA continues to have a variety of functional roles, including as a regulator, in all kingdoms of life . The emergence of multicellular life, however, has required increasingly complex regulatory circuits to orchestrate the development and organization of specialized tissues and organs. Hence, although prokaryotes and unicellular eukaryotes do contain regulatory ncRNAs, the role of ncRNAs as regulators of gene expression has seemingly expanded in the genomes of multicellular organisms .
The explosion of ncRNA research has revealed that there is an abundance of small and long ncRNAs involved in regulating almost all steps of gene expression, including, but not limited to, chromatin modification, transcriptional control, mRNA degradation, translational efficiency and splicing [6,7]. Hence, ncRNAs function in a wide range of cellular processes, play crucial roles in development and disease, and may even play a central role in the evolution of different species and complex organisms [8–10].
The animal genome and the non-coding universe
For half a century, protein-centric convictions predicated on the central dogma of molecular biology dominated the discipline. However, the completion of multiple genome sequencing projects has revealed that protein-coding sequences encompass only a small fraction of animal genomes and less than 2% of the genome in humans and other mammals [11,12]. Thus either animal genomes are largely composed of ‘junk’ DNA, or they contain another form of genetic information that has thus far been overlooked. Several lines of evidence support the latter, including the positive correlation between the proportion of the genome that is ‘junk’/non-protein-coding and developmental complexity , the presence of extensive conserved non-coding sequences outside protein-coding regions , and the pervasive differential transcription of the vast majority of the genome [1,14].
Characterization of the mammalian transcriptome has revealed that RNA transcripts are produced from a considerably greater proportion of the genome than the ∼40% covered (including introns and exons) by known genes [1,15,16]. This pervasive transcription of the genome, defined as occurring when “the majority of (the genome's) bases are associated with at least one primary transcript”  has been identified by a number of independent techniques, including genome-wide tiling arrays, large-scale cloning and sequencing of cDNAs, and next-generation RNA sequencing. For example, results from the ENCODE project, which aimed to identify all functional elements within the human genome, demonstrated that at least 75% of bases were transcribed . A number of studies have shown that although the vast majority of the genome is transcribed at some level, most transcription, including novel unannotated transcription, clusters around known genes [16,18–20]. Such analyses led Kapranov et al.  to propose “a model of genome organization where protein–coding genes are at the center of a complex network of overlapping sense and antisense (long) RNA transcription, with interleaved (small) RNAs” (Figure 1).
The net result of pervasive transcription is a complex and interleaved transcriptome producing approximately 20000 coding genes, along with at least as many, and possibly a much greater number of, transcripts that do not encode proteins and could function at the RNA level . These can be separated into two major groups (the small and long ncRNAs) on the basis of size and mechanism of synthesis.
Small RNAs are generally defined as ncRNAs shorter than 200 nt in length, and are usually produced by the post–transcriptional processing of longer transcripts by endogenous RNases (i.e. RNA cleavage enzymes). Based primarily on their size, mode of biogenesis and function, small RNAs can be divided into various subclasses. The three subclasses of small RNAs that have thus far received the most attention are those that participate in RNAi (RNA interference) pathways, namely, miRNAs (microRNAs), siRNAs (small interfering RNAs) and piRNAs (piwi-interacting RNAs), which produce mature RNAs ∼20–30 nt in length. Nucleotide sequence complementarity lies at the heart of the widespread and potent regulatory control that these small RNAs exert on their targets. In all known RNAi pathways, this regulatory control is mediated by binding of the small RNA to a complex of proteins, chief among them being the Argonaute proteins. There are also slightly larger small RNA species (∼100 nt) that have important cellular roles, including snoRNAs (small nucleolar RNAs), which guide RNA base modification  [and can also be processed into other classes of regulatory small RNA, including miRNAs and sdRNAs (sno-derived RNAs)] ; snRNAs (small nuclear RNAs), which mediate RNA splicing and are important components of the spliceosome ; Y RNAs, which appear to regulate the Ro autoantigen ; and vault RNAs, components of the vault ribonucleoprotein complex . Most small ncRNAs function as part of RNA–protein complexes to regulate gene expression, with the small RNA often acting to specify the target for regulation through nucleotide sequence complementarity (Figures 2A and 2B) .
miRNAs are ∼22 nt long and bind to short regions of complementary sequence, usually located in the 3′ UTR (untranslated region), of target mRNAs . miRNAs can bring about the translational repression or degradation of target transcripts, depending on the extent of complementarity between them (Figure 2A) . The ability of these small RNAs to modulate gene expression was first identified 20 years ago with the discovery of lin–4 in the nematode worm Caenorhabditis elegans [28,29]. Since then, the miRNA field has evolved rapidly and there are now over 1500 miRNAs annotated in the human genome . Mature miRNAs are produced as part of a two-step enzymatic process from a longer pri-miRNA (primary miRNA) that is processed into a pre-miRNA (precursor miRNA) in the nucleus. The pre-miRNA is then transported to the cytoplasm where it is cleaved to release mature miRNAs . The mature miRNA is then loaded on to the RISC (RNA-induced silencing complex), which is composed of the Argonaute2 protein, the miRNA and other auxiliary proteins . The significance of miRNAs in biological processes is highlighted by the fact that one miRNA can potentially target, and hence control the expression of, many hundreds of mRNAs [32,33]. miRNAs can also regulate transcription through mechanisms that are not yet fully understood in mammalian cells . However, it seems likely that there is more to miRNA function than just the inhibition of translation. There are increasing reports describing the presence of mature miRNAs in the nucleus [35,36], suggesting that they may also be directly or indirectly involved in transcriptional gene silencing.
siRNAs are ∼21 nt long and function mainly by degrading transcripts they have perfect sequence complementarity to. The precursor transcripts of endogenous siRNAs include dsRNAs (double-stranded RNAs), pseudogenes  or transcripts with very long stem-loop structures [38,39]. Endogenous siRNAs have been proposed to protect eukaryotic cells from dsRNA viruses and are also important for silencing transposons and other ‘selfish’ genomic elements [40,41]. Since siRNAs use the same enzymatic machinery as miRNAs to function, synthetic siRNAs can be introduced into cells to ‘knock-down’ any given gene, a feature that has been exploited in scientific research and in a number of recently developed therapeutics .
piRNAs are a related class of small (28–32 nt) RNAs that are found only in animals and principally expressed in spermatids . piRNAs are found in clusters in the genome, and appear to arise from long single-stranded precursor transcripts [44,45]. Unlike miRNAs and siRNAs, which bind to the Ago subclade of the Argonaute proteins, piRNAs associate with the Piwi subclass of Argonaute proteins. They have an important role in suppressing the expression of repetitive elements by guiding DNA methylation, and have been shown to be involved in gametogenesis (Figure 2B) [46,47], as well as in neuronal plasticity in the sea hare, Aplysia .
Apart from these major categories, other small ncRNAs have been described that originate from regions adjacent to transcription start-sites, such as tiRNAs (transcription initiation RNAs). These RNAs are ∼17–18 nt long and are abundant at active promoters as well as at loci with evidence of bidirectional transcription , and have been shown to influence the epigenetic state of the genomic region from which they are derived . RNAs of similar size [spliRNAs (splice site RNAs)] are also associated with splice sites . Other, but less well-defined, small RNAs found at or near transcription start-sites that have been reported include PASRs (promoter-associated small RNAs) , PROMPTs (promoter upstream transcripts) , TSSa-RNAs (transcription start-site-associated RNAs)  and TASRs (gene termini-associated small RNAs) .
Consistent with their role in gene regulation, small RNAs are involved in many cellular processes and their dysfunction is implicated in a number of physiological and developmental defects . For example, aberrant expression of miRNAs is implicated in a wide variety of diseases ranging from disorders of the heart to immune diseases and cancer . Additionally, a snoRNA has been demonstrated to play a central role in Prader–Willi syndrome , whereas another has been associated with autism .
lncRNAs (long ncRNAs)
lncRNAs are generally defined as ranging in size from ∼200 nt to over 100 kb in length [57,58]. Although the 200 nt cut-off for lncRNAs is quite arbitrary, it has the advantage of excluding most transcripts accepted to be members of small RNA classes. Unlike small ncRNAs, lncRNAs cannot be easily divided into different subclasses on the basis of sequence characteristics and mode-of-action, and this inability to classify lncRNAs into different subtypes has contributed to their current arbitrary definition. We have previously suggested a more flexible definition that lncRNAs are “noncoding RNAs that may have a function as either primary or spliced transcripts, which are independent of processing into known classes of small RNAs, such as miRNAs, piRNAs and snoRNAs, while also excluding structural RNAs from classical housekeeping families” , such as rRNAs.
The presence of lncRNAs with important functions has been known for some time, with the characterization of lncRNAs such as Xist (X-inactive specific transcript, which controls the silencing of one X chromosome in female mammals) in the 1990s [60,61]. However, the first database of eukaryotic lncRNAs had less than 12 entries by the end of the millennium . The identification of thousands of putative lncRNA transcripts from genome-wide transcriptome analysis [15,63,64], along with prominent examples of functional lncRNAs [65–67], demonstrated that lncRNAs such as Xist were not rare genomic oddities, but were instead the first characterized examples of a large class of novel genes. By the end of 2010 over 100 lncRNAs had been functionally characterized as part of a surge of research into the lncRNA world . The subset of lncRNAs transcribed from intergenic regions (i.e. genomic loci some distance from and not overlapping protein-coding genes) have received the most recent attention and are known as lincRNAs (long/large intergenic ncRNAs). Along with those from intergenic loci, lncRNAs are transcribed from many other regions of the genome including promoters, enhancers, introns, UTRs, as overlapping or non-coding isoforms of coding genes, antisense to other transcripts and from pseudogenes [15,68–70].
Although often of similar length to mRNAs, there are a number of differences between mRNAs and lncRNAs beyond the absence of a functional ORF (open reading frame) in the latter. Analysis of lncRNA expression has shown they have lower expression levels and are more likely to be expressed in highly tissue- and cell-specific patterns [63,71–73]. Unlike most mRNAs and many small ncRNAs, lncRNAs are not as highly conserved, although they do show evidence of conservation in their promoters, primary sequences and splice sites [15,71,73–75]. Furthermore, many lncRNAs consist of a single exon and those that are spliced have fewer exons than protein-coding genes [63,72,73]. lncRNAs also commonly contain transposable elements and other repeats . Sequences from such genetic elements can be ‘domesticated’ during evolution and contribute to lncRNA function by promoting their expression and providing functional motifs .
lncRNAs carry out a diverse range of functions in the cell (Figure 2C). Although few are reported to function catalytically, many carry out RNA–protein, RNA–DNA and RNA–RNA interactions. Similar to many small ncRNAs, lncRNAs can regulate gene expression via RNA–protein (ribonucleoprotein) complexes (Figure 2C) . A common function of lncRNAs appears to be directing the activity of chromatin-modifying complexes and transcription factors by specifying their genomic DNA targets and activating or inhibiting their function [67,78–83]. In these and other contexts, lncRNAs have the ability to act as scaffolds, nucleating the assembly of larger complexes or cellular structures [84–86]. Other reported lncRNA functions include acting as miRNA sponges to ‘soak up’ miRNAs, relieving the repression of mRNAs and so controlling mRNA expression levels and mRNA translation [87,88].
lncRNAs can function both locally and in trans. An example of the former is Airn, which silences the expression of neighbouring genes to regulate imprinting of the Igf2r (insulin-like growth factor 2 receptor) locus [57,65]. Trans-acting lncRNAs include HOTAIR, which is expressed from the HOXC locus and acts to silence gene expression at many genomic locations, including the HOXD locus, by recruiting repressive chromatin modification complexes [67,89].
Given their range of functions, a number of lncRNAs are also implicated or involved in disease states, including functioning as oncogenes or tumour suppressors [87,90,91], as well as being linked to other complex diseases such as myocardial infarction  and Alzheimer's disease .
Regulatory RNAs in prokaryotes
The versatility of RNA as a regulator has also been used by prokaryotes, which contain numerous ncRNAs. Similar to eukaryotes, prokaryotic ncRNAs are being found to play an increasingly important role in regulating gene expression . Despite these similarities, few ncRNA classes are shared between prokaryotes and eukaryotes, with the exception of snoRNAs, which are present in archaea (although not in bacteria) .
Many prokaryotic sRNAs (small RNAs; abbreviation only used in prokaryotes) and antisense RNAs function as ncRNAs. sRNAs are generally defined as transcripts <500 nt and can be expressed from any region of the genome. Antisense RNAs include both sRNAs and longer RNAs that are antisense to coding genes, creating some overlap between the sRNA and antisense RNA classes. ncRNAs are common in prokaryotic genomes, with approximately 170 non-coding sRNAs predicted in the archaea Methanosarcina mazei , ∼50 in the bacteria Listeria monocytogenes  and 165 in Pseudomonas aeruginosa . In comparison, Pseudomonas was found to contain 384 antisense RNAs , whereas Helicobacter pylori was reported to have antisense transcription across most of the genome, covering 46% of ORFs .
Prokaryotic regulatory RNAs function by a variety of mechanisms (reviewed in [94,100]). Cis-antisense ncRNAs commonly act to repress the expression of the sense coding gene. Repression can occur either at the level of transcription, RNA turnover or by inhibiting translation, with the extensive nucleotide complementarity between the two important for many of these mechanisms. Examples include ncRNAs regulating the copy number of mobile elements such as plasmids and repressing the translation of toxic proteins, such as the SymR antisense sRNA in Escherichia coli that represses the synthesis of the SymE toxin protein [101,102].
Trans-encoded sRNAs generally act to repress translation or destabilize target RNA(s), demonstrating some functional similarity to eukaryotic miRNAs. With more limited complementarity than cis-antisense RNAs, trans RNAs often require the RNA chaperone protein Hfq to bind their targets . An example is the association of four sRNAs with Hfq in Vibrio cholerae to control quorum sensing by destabilizing the mRNA of the quorum-sensing master regulator . Some sRNAs also bind directly to proteins by mimicking other nucleic acid sequences. For example, the 6S sRNA mimics the structure of an open promoter to bind RNA polymerase and regulate transcription [94,104].
Another important class of prokaryotic ncRNAs are CRISPRs (clustered regularly interspersed short palindromic repeats). First discovered in E. coli , CRISPRs are now known to exist in most bacteria and archaea . CRISPR loci contain short direct repeats interspersed with spacer regions derived from invading mobile elements. CRISPRs are transcribed and processed to generate small crRNAs (CRISPR RNAs), which function to protect the cell from invading bacteriophages and conjugative plasmids [106,107].
The role of ncRNAs in evolutionary innovation
The sequencing and initial annotation of mammalian genomes provided two large surprises: the large fraction of the genome comprised of sequences derived from transposable elements and the much lower than expected number of protein-coding genes [11,12]. In fact, the number of recognized human protein-coding genes (20687)  is similar to that in the nematode worm C. elegans (20517)  and that in a basal metazoan, the sponge Amphimedon queenslandica (18500–30000) . Furthermore, much of the protein-coding ‘toolkit’ that controls multicellular processes in more complex animals is also present in the sponge . There is widespread use of alternative splicing in human genes , which can diversify the proteome without an increase in gene number, suggesting that it is one mechanism to explain differences in complexity. However, alternative splicing itself requires regulation and hence it has been hypothesized that increases in gene regulatory complexity underlie much of morphological complexity [5,8,10].
Moreover, given the relatively stable protein-coding complement, it is clear that most evolutionary adaptation occurs in regulatory sequences, which are fast evolving and show little conservation over long evolutionary distances [111–113]. The discovery that most of this non-coding DNA is dynamically transcribed to generate tens of thousands of ncRNAs [1,15,16] provides a hitherto unexpected mechanism to explain this increase in regulatory complexity.
ncRNAs, with their potential to bind DNA, RNA and protein in a sequence- or structure-specific manner, are versatile and effective regulatory molecules. By providing specificity to generic protein complexes , ncRNAs can act as guides to selectively target effector proteins to different loci and thereby regulate the transcription or translation of many genes [90,114].
Lastly the pervasive transcription of different ncRNAs in the genome provides a large dynamic pool of transcripts for selection to act upon, as most ncRNAs are subject to more flexible structure–function constraints than protein-coding RNAs [17,111,115]. For instance, many ncRNAs function via the formation of stable secondary and tertiary structures, which can accommodate compensatory nucleotide substitutions, e.g. A:U base pairs to G:U or G:C, without disrupting their structural (and thus functional) integrity . Moreover, regulatory sequences are also subject to positive selection for adaptive radiation . The overarching conclusion is that regulatory ncRNAs represent a vast hidden layer of evolutionarily plastic cis- and trans-acting regulatory information that directs the epigenetic pathways that underpin animal development and diversity [8–10].
The last decade has revolutionized our understanding of genomes and what constitutes a gene. It has become increasingly apparent that many cellular functions are mediated by RNA, a realization that has far-reaching implications for understanding human biology and treating human disease. The following chapters outline the state-of-the-art in the characterization of various types of ncRNAs, although the continuing rapid pace of discovery and unknown function of so many ncRNAs makes it clear that much remains to be done before this poorly charted sphere of biology is fully explored.
• Non-coding RNA genes are abundant in the genome, with similar numbers of protein-coding and non-coding genes in humans.
• Non-coding RNAs are structurally diverse, ranging from less than 20 nt to over 100 kb in length.
• The properties of RNA molecules allow them to function through both sequence complementarity to other RNAs or DNA, as well as forming structures that can interact with proteins and/or nucleic acids.
• Many well-characterized subclasses of small non-coding RNAs are known, with each member of a subclass having a similar functional mechanism, whereas the subclasses and mechanism-of-action of long non-coding RNAs are much less well understood.
• Most functionally characterized non-coding RNAs (whether small or long) function in the regulation of gene expression.
• Non-coding RNAs play essential roles in many biological processes and are crucial for development and disease, and perhaps even the evolution of organisms.
The authors acknowledge the support of the Australian NHMRC (National Health and Medical Research Council) [NHMRC Australia Fellowship number 631668 (to J.S.M.)], the Australian Research Council [DECRA Fellowship (to R.J.T.)] and the University of Queensland [University of Queensland International Research Tuition Award and University of Queensland Research Scholarship (to A.C.)].
- © The Authors Journal compilation © 2013 Biochemical Society