It is now clear that eukaryotic cells produce many thousands of non-coding RNAs. The least well-studied of these are longer than 200 nt and are known as lncRNAs (long non-coding RNAs). These loci are of particular interest as their biological relevance remains uncertain. Sequencing projects have identified thousands of these loci in a variety of species, from flies to humans. Genome-wide scans for functionality, such as evolutionary and expression analyses, suggest that many of these molecules have functional roles to play in the cell. Nevertheless, only a handful of lncRNAs have been experimentally investigated, and most of these appear to possess roles in regulating gene expression at a variety of different levels. Several lncRNAs have also been implicated in cancer. This evidence suggests that lncRNAs represent a new class of non-coding gene whose importance should become clearer upon further experimental investigation.
- dosage compensation
- evolutionary conservation
Transcription (so-called ‘dark matter’ ) outside the boundaries of currently annotated protein-coding gene exons has frequently been detected [2,3]. The biological significance of these transcripts, however, remains far from clear. Scientific opinions are divided between those advocating that all are functional to those proposing that, without strong experimental evidence, they should be considered as being functionally inert; others' opinions lie between these polar extremes [4–6]. Transcripts exceeding 200 nt in length and that are apparently non-coding may be categorized as lncRNAs [long ncRNAs (non-coding RNAs); also previously designated ‘large’ ncRNAs]. A subset of these which do not overlap known protein-coding gene loci are known as lincRNAs (long intergenic ncRNAs) . These have been preferred for investigation because their transcripts and functions are more likely to be independent of known protein-coding genes. It is this subset of the larger class of lncRNAs that we shall mainly discuss in the present chapter. Two examples of mouse lncRNA loci are shown in Figure 1. Most such loci generate apparently non-coding transcripts and can be complex, with transcripts produced whose genomic sequences overlap on both strands. Non-coding transcript maps are also extensive: the majority of nucleotides have been suggested to be transcribed at some point during normal development in, for example, Drosophila melanogaster .
Metazoan genomes are currently predicted to contain thousands of these loci, from approximately 1119 in the fruitfly  to more than 8000 in the human genome [10,11]. Such loci can be described as genes since they show some of the transcriptional, chromatin and evolutionary features of protein-coding genes. Nevertheless, this should not be meant to imply that each is functional. For example, some transcripts (RNA molecules) may not themselves transact a function, even if the act of their transcription is functional, for example by transcriptional interference .
Like the majority of mRNAs, many lncRNAs are thought to be polyadenylated and transcribed by RNA Pol II (RNA polymerase II). Recent mouse and human lincRNA sets have been defined using chromatin immunoprecipitation experiments targeting H3K4me3 (histone H3 Lys4 trimethylation) and H3K36me3 (histone H3 Lys36 trimethylation) modifications [12,13] which are markers for RNA Pol II activity. Such lncRNAs may be spliced, and show a tendency to be expressed in a low and tissue–specific manner, with many thought not to be exported from the nucleus. Nevertheless, there is also evidence for abundant lncRNA transcripts that are not polyadenylated . Little is yet known about such genes, and thus they are expected to be the subject of extensive study in the future years.
In the present chapter we start by describing how lncRNAs have been identified and computationally categorized, before then presenting evidence for molecular and cellular functions for a selection of lncRNAs and their possible involvement in disease processes.
Definition of intergenic transcripts
The first step in defining a set of lincRNAs is to identify transcripts that map to genomic regions lying outside the boundaries of currently annotated protein-coding gene models. The function of a mapped transcript that overlaps, on either strand, a protein-coding gene can be difficult to distinguish experimentally, using targeted knockout or knockdown approaches, from the function of this protein-coding gene. As a consequence, experimental investigation of lncRNAs has more often been focused on those that map to intergenic sequence. Determining whether a lncRNA locus is entirely intergenic is made more problematic because lncRNA transcripts are often incomplete and because they can emanate from a protein-coding gene's promoter or enhancer on either strand . To increase the likelihood that the functions of a lncRNA locus are independent of those of its adjacent protein-coding genes, some studies have, at times, restricted their analyses to consider only lincRNA whose loci lie beyond a minimum distance from the nearest gene model in their analyses  or those whose orthologous sequences in related species are also non-protein-coding .
Intergenic transcripts can be detected experimentally using tiling microarrays . Results from such experiments have been controversial , and early lincRNA collections thus relied primarily on sequenced cDNA and EST (expressed sequence tag) clones . More recently, these catalogues have been superseded by lincRNAs derived from whole transcriptome sequencing (RNA-Seq). This approach generates millions of short (35–100 nt) sequence reads in parallel, and has confirmed that large amounts of intergenic sequence are transcribed into lincRNAs . The high-throughput and relatively unbiased nature of this technique permits detailed assessment of the contribution of lincRNAs to the transcriptomes for a variety of tissues and/or species under different conditions.
Discrimination of coding from non-coding transcripts
Once a set of intergenic transcripts has been defined it is critical to separate those that have protein-coding potential from others that are true ncRNAs. It is relatively straightforward to assign transcripts with open reading frames exceeding 100 codons as being protein-coding . Nevertheless, not all remaining transcripts will be non-coding, as they will also include transcripts encoding shorter polypeptides. To more accurately distinguish non-coding from coding transcripts, more sophisticated approaches have been developed. For example, the Coding Potential Calculator  considers six features of a transcript, including the proportion of the transcript covered by the candidate peptide-encoding region, and the polypeptide's sequence similarity to known proteins. An evolutionary approach, adopted by phyloCSF, predicts ncRNAs when their between-species sequence differences exhibit no bias as to whether they do or do not disrupt putatively encoded peptides .
Rather than relying on predictions, there will always be a preference for the protein-coding capability of a transcript to be determined experimentally. Large proteomic databases are now available for several species, and these can be used to investigate whether the RNA molecule is translated into protein. In vitro translation assays have been developed, but their results do not necessarily reflect in vivo biology. Associations between the candidate lincRNA and the ribosome can also be tested, with the expectation that a true lincRNA will not be translated and therefore would not be associated with this cellular organelle. Nevertheless, a study has reported that approximately half of a set of putative lincRNAs are ribosome-associated , leaving in doubt whether this test is accurate in discriminating coding from non-coding transcripts, or whether half of these transcripts are, instead, protein-coding. Experimental determination of an RNA sequence-dependent or -independent function for a transcript will be necessary for its assignment as a lincRNA. However, this may not always be sufficient because some transcripts will possess both RNA- and coding-sequence-dependent functions .
A computational or experimental method that discriminates accurately between coding and non-coding transcripts is thus currently lacking. A good compromise is to rely on in silico screens for protein-coding potential of putative lincRNAs, but to remain vigilant that these will contain false positive predictions, especially for genes encoding short polypeptides. A study of an individual lincRNA locus should seek to determine whether its mature transcribed RNA molecule is indeed the biologically relevant moiety or, instead, whether it is the act of transcription or the action of any short polypeptide encoded by the mRNA which is required for its function.
Genome-wide indicators of lincRNA functionality
Although many genomes that have been studied contain a considerable number of lincRNA loci, the proportion and number of these that are biologically functional have proved particularly controversial. Unlike for protein-coding genes, because the functional mechanisms of most non-coding transcripts or transcript regions are unknown, point mutation or deletion experiments are difficult to design, and their results are difficult to interpret. In addition, techniques such as RNAi (RNA interference), which reduce the abundance of a transcript, may result in no observable outcome when it is the act of transcription, rather than the mature transcript, that mediates functionality, or if only very low levels of the RNA transcript are required for functionality.
lincRNAs may fold into three-dimensional secondary structures that could be required for the mature RNA molecules to exert their functions. The ENCODE pilot study, which studied 1% of the human genome, predicted between 1500 and 1800 such structured regions . However, there was little overlap in structure predictions produced using different methods, which suggests that they contain a large number of false-positive predictions.
An alternative, and complementary, approach to discriminate between functional and non-functional lincRNAs is to analyse their evolutionary signatures of functional constraint when comparing sequences among related species. This approach assumes that when a locus is functional then deleterious mutations in its sequence which disrupt this function will be preferentially purged from the population. Functional sequences therefore will show better conservation between species relative to neutrally evolving sequence. Some of the most well-studied lincRNAs (such as Xist, see below) are known to be poorly conserved between species. On a genomic scale, a set of 3122 lincRNAs defined by cDNA sequences was shown to be evolutionarily conserved relative to genomically neighbouring, presumed neutral, sequence . lincRNA promoters are particularly highly constrained, suggesting that it could be the act of transcription itself that is more often important for these lincRNAs' functionality [16,23]. A similar pattern was observed in a second set of 1675 mouse lincRNAs which was defined using chromatin markers for active promoters (H3K4me3) and actively transcribed exonic sequence (H3K36me3) , and also in a collection of 1119 lincRNAs in D. melanogaster that were discovered using RNA-Seq transcriptome evidence .
Analysis of expression patterns
Instead of considering sequence conservation, other approaches have considered the conservation of transcription to be an indicator of these transcripts' functionality. A detailed analysis of four mouse lincRNAs revealed that their brain expression patterns can be conserved between diverse vertebrates, such as chicken and opossum, a marsupial . lincRNAs can also be identified between more diverse species where their position, if not the primary sequence, is conserved. Comparisons between Drosophila and mouse have revealed an excess of these positionally equivalent lincRNAs . This was also seen when comparing lincRNAs in zebrafish and humans, where the mutant phenotype of two of these zebrafish lincRNAs could even be rescued by the mature form of the positionally conserved mouse or human sequence . Few lincRNAs are, however, so deeply conserved in their expression profiles. Of a sample of eight mouse lincRNAs validated by Northern blotting, five were also found to be expressed in rat, but none were found to be expressed in any of the human tissues or cell lines tested .
Another proposed indicator of functionality is when lincRNAs exhibit differential gene expression levels in different tissues or at different timepoints. It is argued that variations in gene expression levels reflect transcriptional regulation. For example, a survey of in situ hybridization images from the adult mouse brain collected by the Allen Mouse Brain Atlas revealed 849 lincRNAs that are expressed in the brain, 513 of which showed distinct regional patterns of expression . Nevertheless, differences in transcript abundance might reflect inconsequential transcriptional events resulting from tissue- or developmental-stage-specific transcription factor binding and thus might not be considered to provide strong indicators of functionality.
lncRNA molecular mechanisms
Despite the difficulties inherent in their experimental investigation (see above), most of the few lncRNAs, including lincRNAs, that have been studied in detail have demonstrated roles in the regulation of gene expression. This regulation can be exerted at any one of a number of different levels, as illustrated in Figure 2.
Individual lincRNAs that aid in the regulation of chromosome-wide gene expression have been identified in both Drosophila and mammals. Specific lincRNAs are involved in dosage compensation which equalizes the dosage of gene expression from the X chromosome between females with two X chromosomes and males with only a single X chromosome. In Drosophila, this is achieved by hypertranscription from the single X chromosome in males, whereas, in mouse, transcription from one X chromosome is mostly inactivated in female cells.
In Drosophila, transcription from the male X chromosome is regulated by the MSL (male-specific lethal) complex, a complex containing several proteins [MSL1, MSL2, MSL3, MOF (Males absent on the first) and MLE (Maleless)] and two lincRNAs [RNA on X1 (roX1) and RNA on X2 (roX2)], which are both transcribed from the X chromosome . Mutant analysis has suggested that the complex binds 30–40 ‘entry’ sites on the X chromosome  and then spreads in cis to coat the entire chromosome, leading to H4K16 acetylation at actively transcribed gene loci, a more diffuse chromosome morphology and hypertranscription . roX1 and roX2 are functionally redundant, despite sharing little sequence similarity and displaying distinct embryonic expression profiles . These observations can be reconciled, at least in part, by experiments showing that most of the sequence of these lincRNAs is not required for normal function .
The equivalent mechanism in mice is quite different to that in Drosophila. Eutherian XCI (X chromosome inactivation) is thought to require the coating of the X chromosome by RNA produced from only one lincRNA locus, the 15 kb Xist (X-inactive specific transcript) . This lincRNA is transcribed from the inactive X chromosome and spreads strictly in cis to coat the chromosome and prevent transcription . Whether this requires a complex of proteins, as in Drosophila, or whether the act of coating by this lincRNA is sufficient to modify the X chromosome chromatin environment, remains unknown. The sequence of Xist is only 60% conserved across eutherian mammals, but the gene structure is conserved between mouse and humans, with several short well-conserved regions . Xist expression is regulated by an antisense-encoded lincRNA, named Tsix , whose promoter is 13 kb downstream from Xist.
lncRNAs can also regulate the expression of genes that are physically linked on the chromosome, such as those present in imprinted gene clusters. Imprinted genes are expressed from only one allele in a diploid animal, and this expression depends on whether it is inherited from the maternal or paternal allele. Imprinted genes generally occur in clusters, suggesting that they are regulated as a single domain where lncRNA expression from one allele is often associated with repression of the protein-coding gene on that allele. One such example is the mouse lncRNA Airn, whose locus overlaps the Igf2r gene (Figure 1A). Airn is normally transcribed exclusively from the paternal allele, where it prevents expression of a gene cluster containing Igf2r, Slc22a2 and Slc22a3 from that allele, despite being antisense only to Igf2r . Disrupting the Airn promoter causes Igf2r to be expressed from the paternal as well as the maternal allele . The mechanism causing this remains unclear, although it appears to involve H3K9me3 recruitment at the Slc22a3 promoter . The Airn locus is poorly conserved across related species , suggesting that this type of lncRNA regulation may be lineage-specific.
Cis- compared with trans-regulation
Recently, the question of how lincRNAs regulate individual protein-coding genes has attracted interest from several research groups. lincRNAs are thought to regulate gene expression either in cis, where the lincRNA acts only on the same DNA strand from which it is transcribed, or in trans, where the lincRNA can regulate genes located on other chromosomes.
For example, in mouse, the Dlx5 and Dlx6 protein-coding genes are up-regulated in cell culture by a lncRNA known as Evf2 (Figure 1b) whose transcribed locus overlaps an intergenic enhancer lying between Dlx5 and Dlx6 . The lncRNA and the two protein-coding transcripts share a similar expression pattern in the ventral forebrain, and Evf2 can drive reporter expression in two neural cell lines in a dose-dependent manner. In mice expressing a truncated Evf2 transcript, it appears that Dlx5 and Dlx6 are up-regulated in contrast with what was observed in cell culture , and these mice show a reduced number of GABAergic neurons in the early postnatal hippocampus. This effect on the GABAergic neuron counts is unlikely to be mediated through Dlx5 and Dlx6, suggesting that Evf2 may also possess a trans-acting function.
Attempts have been made to identify examples of potential cis-regulation at a genomic scale. Mouse lincRNAs tend to be found transcribed near to genes annotated with specific Gene Ontology terms and, in particular, to genes also involved in the regulation of transcription [12,42]. The expression of several of these lincRNA-protein-coding gene pairs has been subsequently investigated experimentally by in situ hybridization . They were shown to have similar expression patterns in the developing mouse brain, which suggests that the lincRNA may positively regulate the expression of its adjacent protein-coding gene. Cis-encoded RNAs have also been identified which are transcribed through gene enhancers and which may function in a positive manner to regulate target gene expression. Recent experiments on a set of these eRNAs (enhancer RNAs) in human cells have suggested that the mature RNA molecule can be essential. siRNA (small interfering RNA)-mediated knockdown of seven (of 12 tested) putative eRNAs resulted in a significant disruption of transcription of a genomically proximal protein-coding gene . At the human growth hormone locus, it appears to be the act of transcription through the eRNA which is important, because its enhancing effect remains even when the complete RNA sequence is replaced .
Another study has, however, suggested that this type of cis-relationship may confer little or no functional relevance, as siRNA constructs targeting a set of 147 lincRNAs in mouse embryonic stem cells revealed mostly trans-acting effects of disrupting the expression of these lincRNAs . This type of experiment targets the mature RNA molecule and preferentially reveals the trans–acting functions of these lincRNAs. Whether lincRNAs act to regulate protein-coding gene transcription largely in cis or in trans remains a focus of much current research.
A lincRNA has been defined which, rather than regulating the initiation of transcription, negatively regulates elongation of transcription beyond the gene promoter. 7SK, a 331 nt lincRNA, is thought to be involved in this process by sequestering proteins involved in transcriptional elongation, such as P-TEFb (positive transcription elongation factor) components, away from transcribed sites. Instead, they are stored in restricted domains, known as speckles, within the nucleus . Recently, a possible Drosophila orthologue of 7SK has also been identified, suggesting that this mechanism may be deeply conserved across evolution . The downstream genes that are affected by this remain to be identified.
lincRNAs have also been discovered that are involved in gene regulation beyond the act of their transcription. In D. melanogaster, three non-coding heat–shock response (hsr-ω) transcripts are induced from the 93D locus by heat shock, CO2 exposure and following the release of ecdysone in third instar larvae . This 10–20 kb locus is functionally conserved in all Drosophilid species. One short transcript is cytoplasmic, whereas the other two remain at the locus from which they are transcribed and within nuclear ‘omega’ speckles thought to be storage sites for RNA-processing proteins. Although they are up-regulated in response to stress, these transcripts must also play a housekeeping role in the developing animal as only 20–25% of trans-heterozygote mutant embryos hatch .
Protein function can also be regulated by lincRNAs, as exemplified by the ncRNA repressor of the nuclear factor of activated T-cells (NRON). This lincRNA contains three exons and is alternatively spliced to produce transcripts varying in length from 800 nt to 3.7 kb . Through its interaction with the nuclear import factor KPNB1, NRON contributes to preventing the protein product of NFAT (nuclear factor of activated T-cells) from entering the nucleus which, in turn, prevents NFAT from promoting transcription of its target genes .
lincRNAs have been implicated in diverse human diseases
lincRNAs are now known to be involved in a diverse array of cellular processes, examples of which were discussed above. This suggests that, when deleterious alleles occur within their loci, then an abnormal phenotype may arise. In humans, this might be manifested in disease. In fact, lincRNAs may be involved in various nervous system diseases through their association with protein-coding genes which are important in diseases such as Fragile X syndrome and Alzheimer's disease (reviewed in ). Several lincRNAs have also been implicated in cancer, which might be expected as their principal role appears to be in regulating gene expression, a dysregulation of which is a hallmark of tumorigenesis. In the following section, we discuss two specific loci as examples of disease-associated lncRNAs, ANRIL (CDKN2B antisense RNA 1) and MALAT-1 (metastasis associated in lung adenocarcinoma transcript-1), which are proposed as being involved in the pathogenesis of different, but overlapping, types of cancer (Figure 3).
ANRIL is a 3834 nt RNA molecule made up of 19 exons that is transcribed from within the 9p21.3 locus. This locus is associated with susceptibility to a variety of complex diseases, including coronary artery disease, ischaemic stroke, aortic aneurysm, Type II diabetes, glioma, several carcinomas, malignant melanoma and acute lymphoblastic leukaemia . The 9p21.3 locus also contains two cyclin-dependent kinase inhibitor genes, CDKN2A and CDKN2B.
Among several SNVs (single nucleotide variants) that have been associated with disease, many are more strongly associated with ANRIL than with CDKN2A or CDKN2B expression . ANRIL interacts with at least two protein complexes (Polycomb repressive complex 1 and 2) which negatively regulate the expression of CDKN2A and CDKN2B respectively [53,54]. Furthermore, the regulation of CDKN2B may act in cis, as suggested by the stronger response of an exogenous reporter, relative to the endogenous gene, when inserted near to an antisense construct . It has been speculated that this regulation of other genes by ANRIL may contribute to cellular aging and, thereby, its involvement in several, seemingly unrelated diseases.
MALAT-1 expression is found in a broad range of tissues and is significantly greater in metastatic NSCLC (non–small cell lung cancer) than in tumours which did not metastasize . High expression of MALAT-1 is also related to a reduced survival rate of patients with stage I NSCLC . Since this initial study, MALAT-1 has been found to be similarly up-regulated in other carcinomas, including breast, pancreas and colon cancer , which suggests that MALAT-1 may have a general importance in carcinogenesis. MALAT-1 transcripts are trans-acting as they are retained in the nucleus, where they influence alternative splicing of hundreds of transcripts by regulating the localization to nuclear speckles of multiple pre-mRNA splicing factors such as the serine/arginine splicing factor SF1 .
Despite only gaining widespread recognition relatively recently, it is becoming increasingly clear that large numbers of lncRNAs are transcribed from virtually all eukaryotic genomes sequenced to date. Detailed studies of individual loci have revealed that these genes can contribute to a wide range of cellular and molecular phenotypes. Many of the lncRNAs identified to date have been implicated in the regulation of gene expression, yet even the few examples that could be presented in this brief review show the diversity of cellular processes with which lncRNAs have been associated.
Increasing access to relatively low-cost next-generation sequencing will allow identification of large numbers of hitherto unseen lincRNA loci expressed in a variety of tissues (owing to their high tissue-specificity), in many different species and under a variety of physiological conditions. The future challenge of lincRNA biology thus lies less in the initial identification of lincRNAs, but in the determination of their relative contributions to different cellular processes and to lineage-specific biology.
• Thousands of lncRNA (long non-coding RNA) loci have already been identified in a variety of species, suggesting that they are important components of metazoan genomes.
• Expressed sequence tag and RNA-Seq analysis can be used to identify lncRNAs as novel transcripts with little or no protein-coding potential.
• Genome-wide evidence for lncRNA function has been provided by analyses of their evolutionary constraint and, to a lesser extent, differential expression profiles.
• Most lincRNAs (long intergenic non-coding RNAs) which have been studied in detail appear to be involved in the regulation of transcript abundance at some level, from chromatin modification to nuclear localization.
• Future work is likely to focus on larger-scale experimental analysis of lncRNA function.
Work of the authors is supported by the UK Medical Research Council and by an ERC Advanced Grant.
- © The Authors Journal compilation © 2013 Biochemical Society