Over twelve years ago, in June 2000, scientists unveiled the Human Genome Project's "working draft" of the human genome. At this stage the human sequence still had many gaps. It covered 90% of the genome and had 250,000 gaps with many errors in nucleotide sequences (International Human Genome Sequencing Consortium, 2001). The importance of sequencing the human genome was the links for potential improvements in several areas. These included being able to identify links to rare and common diseases, predicting an individual’s risks for disease and responsiveness to drugs and the development of "made to measure" drugs. While these are the medical importance of the project it also provided insights to human evolution and history (Lander, 2001). Now, over twelve years later, while there have been some major advances even Francis Collins, formerly the leader of the publicly funded sequencing efforts, recently commented: "the consequences for clinical medicine . . . have thus far been modest . . . the Human Genome Project has not yet directly affected the health care of most individuals (Collins, 2010). So what have we learned over the past 12 years? What are the implications and applications of this knowledge? And finally what’s next?
Advances in our understanding
Our knowledge of the contents of the human genome in 2000 was still quite limited. Since then there have been major advances. In order to study how genetic variants contribute to phenotypic diversity, large-scale studies were undertaken to identify and catalogue nucleotides that differ among individuals. Initial studies focused largely on understanding the range of patterns and frequencies of single-nucleotide polymorphisms (SNPs) (Altshuler, 2005, Frazer, 2007). Detailed maps of genetic markers of human variation, mostly SNP’s (SNPs), have helped to associate known SNPs with disease predisposition. Today, the vast majority of human variants with frequency >5% have been discovered and 95% of heterozygous SNPs in an individual are represented in current databases (Lander, 2010). A further breakthrough was the discovery of the haplotype structure of the human genome; that is, that genetic variants in a region are tightly correlated in structures called haplotypes, reflecting linkage equilibrium and separated by hotspots of recombination (Lander, 2010). Tight correlations are also seen in a few dozen regions suggesting that a limited set of around 500,000–1,000,000 SNPs could capture about, 90% of the genetic variation in the population. Comprehensive catalogues of variants have been built by organisations including The SNP Consortium (Thorisson, 2003), the International HapMap Project (International HapMap Project), The Human Genome Diversity Project (HGDP) (Cavalli-Sforza, 2005), project about human genetic variation ENCODE (The ENCODE Project Consortium) and is on-going with the "1000 Genomes Project".
In 2007 ENCODE (Encyclopaedia of DNA Elements) published its pilot data and showed that nearly the entire genome may be represented in primary transcripts that extensively overlap and include many nonprotein-coding regions (Weinstock, 2007). In 2010, HapMap3 was published (International HapMap3, 2010) which genotyped 1.6 million common SNP’s in 1,184 reference individuals from 11 global populations. The 1000 Genomes Project has built on this, its aim being to characterize over 95% of variants that are in genomic regions accessible to current high-throughput sequencing technologies and that have allele frequency of 1% or higher (The 1000 Genomes Project, 2010 – see Appendix). While sequencing has advanced it still expensive to "deeply sequence" all samples. The 1000 genome project will use ‘light sequencing’ of 2500 individuals which will provide efficient detection of most of the variants in a region. The pilot study published in 2010, described the location, allele frequency and local haplotype structure of around 15 million SNP’s, 1 million short insertions and deletions, and 20,000 structural variants. The authors concluded that over 95% of the currently accessible variants found in any individual are present in this data set and that on average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. In 2012, the Project then went on to describe the genomes of 1,092 individuals from 14 populations, and provide a validated haplotype map of 38 million SNP’s, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. (The 1000 Genomes Project Consortium, 2012). A key finding has been the demonstration that some of the rarest DNA variants tend to cluster in relatively restricted geographic areas. Data from the 1000 Genomes Project are already being widely used to screen variants discovered in exome data from individuals with genetic disorders and in cancer genome projects (Cancer Genome Altas Research Network, 2011).
Implication and Applications
Advances in genomics have brought in some significant changes in medicine. At the time when the HGP was launched, fewer than 100 disease genes had been identified (Lander, 2010). Today, studies have resulted in the identification of more than 2,850 Mendelian disease genes underlying more than 1,100 loci affecting common polygenic disorders and more than 150 targets of somatic mutation in cancer (Lander, 2010). There are now a number of genetic tests that can predict risk of disease risk, severity or risk of recurrence and pharmacogenomics testing to predict response to drugs.
The field of cancer research has been a clear beneficiary. In 2000 there were around 80 cancer genes implicated in solid tumors (Lander, 2010), but by the end of the decade over 230 were identified (Lander, 2010). For example, it has been shown that BRAF mutations occur in >50% of melanomas (Davies, 2002), PIK3CA mutations have been discovered in >25% of colorectal cancers (Samuels, 2004) and EGFR mutations in 10–15% of lung cancers and predicted the responsiveness to the drugs gefitinib and erlotinib, drugs that had had only limited efficacy (Paez, 2004). In colorectal cancer KRAS mutation has been shown to preclude efficacy of treatment with EGFR antibodies and KRAS status determination is now mandatory before treatment (ESMO, 2012).
Data from catalogues of genetic variants are now being used in genome-wide association studies (GWAS) to identify specific disease risk loci, underlying cellular pathways & identify new therapeutic approaches. These studies compare variants in populations vs. control and have shown that most traits can be influenced by a large number of loci and that most of the common variants at these loci have a moderate effect. Association studies have identified more than a thousand genomic regions associated with disease susceptibility and other common traits (Hindroff et al, 2010). For example Type 2 diabetes has 39 loci associated with the disease.
One of the key areas where genomics had added value though is the development of predictive or diagnostic tests. The Centers for Disease Control and Prevention (CDC) found over 300 genomic tests introduced into clinical practice since 2009 (Gwinn, 2011). Tests are now available for nearly 3000 genetic disorders in clinical practice (Gene Tests Website, 2012). Many are now marketed directly to consumers and can inform consumers of preventive interventions, such as dietary change and physical activity. Pharmacogenetics association studies have also helped to identify genetic factors underlying responsiveness to particular drugs. For example hypsensitivity to the antiretroviral drug abacavir and drug induced myopathies associated with cholesterol-lowering drugs (Lander, 2010). In clinical trials stratification of patients based on responsiveness to targeted therapeutics could help improve the overall efficacy of drug discovery and allow the identification of individuals genetically susceptible to adverse reactions (Caskey, 2010).
Ultimately, genomics offers the promise of personalised medicine, whereby an individual’s genetic profile could be used to identify risks and direct approaches to prevent disease or facitate the prescribing of medicines that suit a person’s specific needs.
Advances in Scientific Techniques:
Over the last decade the drive to understand the human genome, has stimulated significant progress in scientific methods and techniques, what Varmus calls the "engines of genomics" (Varmus, 2010). The HGP project used essentially the same sequencing method introduced by Sanger in 1977. New methods have now increased rates of DNA sequencing by at least five orders of magnitude. One of the key advances was the development of gentype arrays (SNP chips) enabling millions of variants to be assayed simultaneously (Lander, 2010.). With todays so-called ‘massively parallel’ sequencing technology the per-base cost of DNA sequencing has plummeted by 100,000-fold and the current generation of machines can read over 250 billion bases in a week (Lander 2010).
The massive scale of data elicited has also necessitated major changes in data handling and analysis (bioinformatics) and computational power.
Changes in instrument capacity over the past decade, and the timing of major sequencing projects
Source: Mardis, 2010
Understanding of Human History
Genomic data has also significantly contributed to our understanding of human evolution. Population geneticists, through the International HapMap Project have catalogued common variants in European, East Asian, and African genomes. Data shows that migration of humans from Africa was more complicated than previously thought, and that human history involved not just successive population splits, but also frequent mixing (Lander 2010). Work has also looked at differences to our closest relatives, for example genome analysis has shown that modern humans mixed with the Neanderthals. Europeans and Asians have all inherited 1–4% of their genome from Neanderthals (Green, 2010).
Despite the advances discussed above, a number of commentators argue that the benefits from the Human Genome Project have not yet lived up to the full promise.
Many "common diseases" are more complex than initially thought and caused by large numbers of rare and unique variants that make the hope for one-size-fits-all therapeutic approaches almost impossible. For example, of the 850 sites in the human genome implicated in common diseases, most are found near gene coding regions rather than within them (Lancet, 2010). Additionally, common disease variants explain only a fraction of the genetic risk of disease, there is still much work needed identify strongly influential haplotypes.
In some cases our understanding may not be limited by the amount of data but the computational challenges around data analysis, display and integration. New bio informatics and computational approaches are still required to handle the volume and complexity of data.
The translation of our improved scientific understanding into the clinical setting has also progressed more slowly than hoped. For example, the identification of mutations, predictive of responsiveness to drugs in the treatment of lung cancer, was slow to enter the clinic. (Schully et al. 2011)
In reality, the hopes of truly "personalised medicine" has failed to materialize. Although it may be feasible and cost effective to sequence an individual’s full genome the data is still difficult to interpret and translation into personalised therapy. This requires many more advances in the identification of expression pathways and the development of targeted therapeutics.
Advances in science have also clashed with the worlds of business, law, regulation, ethics, and health insurance. For example there have been disputes regarding ownership or patentability of gene sequences. Concerns over whether an individual’s data could be used to discriminate against them for example in employment or insurance. The direct-to-consumer marketing of genotypes, as markers of disease risk, is a major advance, but with the lack of regulation and external standards for accuracy these potential breakthroughs also have the potential to mislead customers (Varmus, 2010)
There is more work needed to build data sets to help understand all of the functional elements encoded in the human genome and the underlying regulatory interactions. Enhanced reference catalogues, such as being built in the 1000 genomes project will enable better associations with specific phenotypes. Technology will continue to advance and hopefully become so simple and inexpensive that it can be routinely used in the clinic. This should help to identify the >1,800 uncloned disorders in the current catalogue and should also continue in recognizing undescribed genetic disorders in patients with unexplained congenital conditions. Ultimately, the data aim could include the detection of heterozygous carriers (for prenatal counseling); characterizing patients’ germline genomes (to detect strongly predictive mutations), to identify causes of disease of unknown aetiology, to target specific treatments. Gene Therapy could become commonplace with the replacement of faulty copies of genes with normal ones being a reality.
The Human Genome Project and subsequent work has led to breakthroughs in our understanding of the fundamental nature of the human genome and led to major advances in health. However, there is a lot of work to be done. Studies such as The 1000 Genomes Project take us closer to a complete description of human genome and will allow more accurate characterisation of disease associated variants and the role of inherited DNA variation in human history, evolution and disease. In February 2011, more than 10 years after a draft sequence of the human genome was published, the National Human Genome Research Institute announced its new strategic plan for genomic medicine from" basepairs to bedside" .
However cconnecting all of the different strands into a comprehensive view that translates into improved health outcomes, may however require something like an international ‘One Million Genomes Project’. Achieving these goals will rely on new technologies, collaboration across multidisciplinary and international teams, high-throughput data production and analysis, new computational approaches, and attention to societal implications.