The Human Genome Project achieved the major milestone of identifying an almost complete sequence of the approximately 3 billion nucleotides contained in the human haploid genome. Since the publication of these single consensus genomes, advances in sequencing technology have enabled the reporting of in excess of 100,000 complete genomes and the characterisation of millions of genetic variants in millions of individuals. These developments have advanced our understanding of gene function and regulation and have led to the direct consumer use of genetics and the discovery of the genetic contributors to thousands of common and rare diseases.
The Human Genome Project has been an international effort from its beginnings and had critical predecessors in the human gene mapping (HGM) meetings focused on identifying the chromosomal location of normal and disease-causing genetic variants. The community established by the HGM meetings provided an infrastructure that enabled the more comprehensive sequence-based maps to be developed in the wake of the HGM meetings. The international effort continues, with some countries, including the United Kingdom, conducting population-based studies as part of the International Genome and the 1000 Genome Projects objective to apply whole genome or exome sequencing to tens of thousands of individuals. These collaborative international efforts have created the framework for one of the greatest successes of the Human Genome Project—that is, information generated relevant to the human DNA sequence and its variation held in public trust and with open access to the scientific community through accessible databases and analytic platforms such as GenBank, which is the National Institute of Health genetic sequence database.
One major hub for clinicians that contains both clinical and scientific descriptors of genetic findings that is searchable by disease, phenotype, or genetic variant is the website Online Mendelian Inheritance in Man (OMIM)—a comprehensive catalogue, updated daily, of now more 15,000 genes and diseases with some degree of association, including the ever-growing total of more than 4000 validated single gene disorders. The online interactive version, www.ncbi.nlm.nih.gov/omim/, a compendium of human genes and phenotypes, also provides links to multiple other online resources that provide information about genetic variants, including their associations with disease, genetic conservation across species, the differences in variant prevalence between populations of different ancestries, predicted variations in protein structures relative to genetic variations, proteins, and classification of genes in thousands of different biological pathways (http://omim.org/help/external ). The Clinical Sequencing Exploratory Research Consortium, funded by the National Human Genome Research Institute and the National Cancer Institute, is among the newest collaborative efforts accumulating whole genome and exome sequences and studying their role within the practice of medicine. The consortium is exploring analytic and clinical validity and utility, as well as the ethical, legal, and social implications of sequencing via multidisciplinary approaches.
There are 2 technologies that are cost-effective for medical uses and can examine the entire genome with resolution at the level of a single base pair. These are single-nucleotide polymorphism (SNP) arrays (microarrays or chips) and next-generation sequencing. We use both of these technologies at Genomic Medicine UK.
A. Single Nucleotide Polymorphisms
The human genome’s 3.2 billion bases include many that are polymorphic. Polymorphisms are bases that are not the usual one at a defined position and yet occur with a frequency of >1% within a given population. For any 2 individuals who are not closely related, there are approximately 2.5 × 10 6 SNPs that vary between them (and between each individual and the canonical human reference), or about 1 polymorphism for every 1,000 bases in the genome on average. Within a single ethnic population, there is about 1 common SNP per 3,000-7,000 bases, where common means a greater than 10% chance that any 1 patient will be polymorphic (heterozygous) at that position. A few hundred thousand to a few million of these common SNPs can be included on a DNA hybridisation array and examined simultaneously in a single laboratory test.
Single nucleotide polymorphisms (SNPs) were the first commonly characterised contributors to human genetic variation in the molecular age. Prior to that, only visible chromosomal variants and amino acid variants found in proteins comprised the known genetic variations in humans, and these were limited in number and were prohibitively expensive to characterise at a population level. SNPs are specific nucleotide sites in the human genome where it is possible to have two (or even three or four) different nucleotides at a specific position on a chromosome. For example, there might be either a Thymine or a Guanine at a specific site. These sites in the genome where variations occur are common, with up to 1% of the 3 billion bps of the human DNA sequence being potentially variable between any two individuals, resulting in tens of millions of SNPs across the genome. Most variation is found across all human populations, although some variants appear to be highly population-specific or ancestry-specific. Chip-based DNA genotyping allows the genotyping assays of greater than 1 million SNPs simultaneously on one individual at an affordable cost. Known SNPs are catalogued in the online public-domain resource the Single Nucleotide Polymorphism database: (http://www.ncbi.nlm.nih.gov.rsm.idm.oclc.org/snp).
Combinations of SNPs are commonly inherited together in the same region of DNA, forming haplotypes. Genome-wide haplotypes can be constructed by linkage disequilibrium (LD) analysis. LD analysis is a statistical measure of the extent to which particular alleles or SNPs at two loci are associated with each other in the population, and LD occurs when haplotype combinations of alleles or SNPs at different loci occur more frequently than would be expected from a random association. SNPs and alleles of interest are presumably inherited together if they are physically close to each other (usually <50 kilobases [kb]), producing strong LD. Therefore SNPs that are in LD with a disease phenotype or response-to-drug phenotype can mark the position on the chromosome where a susceptibility gene is located, even though the SNP itself may not be the cause of the phenotype. By studying millions of SNPs in hundreds of individuals from geographically diverse populations, the international HapMap consortium created genome-wide maps of haplotypes, which is one of the major open sources for SNPs genomic analysis.
B. Next-Generation Sequencing
Next-generation sequencing (NGS) has been applied to whole-genome sequencing (WGS), whole-exome sequencing (WES), and RNA profiling (RNA-Seq). High-throughput sequencing (i.e., NGS) technologies parallelise the sequencing process of millions of reactions, producing thousands or millions of sequences concurrently. A fundamental process of NGS is that clonally amplified DNA templates, or single DNA molecules, are sequenced in a massively parallel fashion in a flow cell. Please see the diagram below: