NHS Digital Data Release Register - reformatted

Genomics England

🚩 Genomics England received multiple files from the same dataset, in the same month, both with optouts respected and with optouts ignored. Genomics England may not have compared the two datasets, but the identifiers are consistent between datasets for the same recipient, and NHS Digital does not know what their recipients actually do.

Project 1 — DARS-NIC-374190-D0N1M

Opt outs honoured: Yes - patient objections upheld (Statutory exemption to flow confidential data without consent)

Sensitive: Sensitive

When: 2020/07 — 2020/07.

Repeats: Ongoing, One-Off

Legal basis: COPI Regs 2020

Categories: Identifiable

Datasets:

  • Secondary Uses Service Payment By Results Spells
  • Bridge file: Hospital Episode Statistics to Mental Health Minimum Data Set

Objectives:

This agreement is seeking approval to request data to support The GenOMICC - COVID Genomics UK (CoG-UK) partnership in researching Whole genome sequencing of patients severely affected by COVID-19. The work programme has sign off and prioritisation from the Chief Medical Officer for England The goals of this work programme are set out below: 1. To harness world-leading UK healthcare and genomic infrastructure and systems to undertake prospective host whole genome sequencing at scale. This will elucidate the genetic architecture of host response to SARS-CoV-2 and identify opportunities to improve outcomes in the current pandemic, via international collaboration. 2. To identify rare and common variants that may affect susceptibility to response, identify novel opportunities for intervention and accelerate recovery. 3. To collect longitudinal life course datasets from primary care, hospital episodes, intensive care registries and outcomes via an extant partnership with NHS Digital and Health Data Research UK. We will include deep immune “omic” datasets on a subset of patients 9. This will allow case-control studies that capitalise upon unique UK assets, such as the 100,000 Genomes Project (97,000 people) and UK Biobank data sets (500,000 people) with WGS where more cases may be identified, including those with milder disease and unaffected people. 4. To use these rich data sets to understand the premorbid, concurrent and consequent sequelae of COVID-19 infection. 5. In partnership with the CoG-UK Viral Programme to evaluate the combination of viral and host genomics on outcomes to give pre-emptive insights into subsequent outbreaks and potentially future pandemics. 6. To provide access to these data sets via the Genomics England Trusted Research Environment to international and national academia and industry and facilitate international collaboration on COVID-19. 7. To link this to national COVID-19 clinical trials infrastructure offering potential for genomics to add value with insights into precision medicine and building a global-leading knowledgebase to enable better UK-wide and international capacity for future pandemic preparedness. 8. To engage and involve public and patients in setting strategy and priorities that shape the programme and it’s outputs. This will initially be based upon the 100,000 Genomes Project Participant Panel. The prospective GenOMICC CoG-UK study The variable response to COVID-19 suggests that, as with susceptibility to other infections, critical illness and mortality from COVID-19 may be determined by host genetic factors. From the 100,000 Genomes Project and the NIHR BioResource for Rare Disease it is known that rare variants cause immunodeficiency. By undertaking a prospective study design that leverages existing recruitment infrastructure in critical care, together with Genomics England, NHS England, Devolved Nations and PHE infrastructure this research will be able to apply the most advanced genomic testing to those most severely affected people admitted to hospital or intensive care. The retrospective GenOMICC CoG-UK study: The retrospective cohorts offer control arms for the study but also new case finding potential, particularly for people who have a milder clinical case. The study propose to harness the potential of two key national data assets. Firstly, analysis of the 100,000 Genomes Project data set which provides the genome sequences of 97,000 participants where they can use their longitudinal life course datasets to identify those affected by COVID-19, as well as providing appropriate unaffected controls. This dataset includes 627 families with rare immunodeficiency syndromes, which may allow insights to be accelerated because of co-existence of rare variants. Secondly, the UK Biobank cohort will provide 120,000 WGS this year, building to 500,000 WGS over the next 18 months from people, currently aged circa 55-85 years old, which is skewed towards the at-risk age groups for COVID-19 but will provide additional cases and controls, including mildly affected individuals. Over the next five years these datasets will be enriched further by consented individuals from the NHS Genomic Medicine Service, and the new Genomics England programmes proposed in their strategic plan under consideration by Government. Furthermore, the Accelerating Detection of Disease Cohort will enrol up to 5 million people with genome-wide variants available with longitudinal life course data sets that could particularly add value to host response-variant associations and polygenic risk scores. Genomics England Background: Genomics England was established by the Department of Health to deliver the 100,000 Genomes Project. This followed the announcement in December 2012 by the Prime Minister of a programme of whole genome sequencing (WGS) as part of the UK Government’s Life Sciences Strategy. The principal objective of the 100,000 Genomes Project was to sequence 100,000 genomes from participants with cancer and rare disorders, and to link the sequence data to a standardised, extensible account of diagnosis, treatment, and outcomes gathered at recruitment, but primarily through the ongoing collection of medical records. Data will be released into the Genomics England Trusted Research Environment where it will be linked with associated clinical data as well as additional data sources, which will include NHS Digital data requested through this agreement, COVID-19 testing feeds and viral genomics from Public Health England and data feeds from the Intensive Care National Audit Registry (ICNARC). The Department of Health and Social Care have granted approval for Genomics England to procure a new, rapidly deployable Research Environment for the COVID-19 programme from existing core funding, which will provide a secure and collaborative workspace that enables researchers to perform COVID-19 genomic data analysis. The new Research Environment (COVID-RE) will provide an intuitive, integrated and collaborative user experience that enables effective COVID-19 research outcomes across a wide range of academic and Biotech/Pharma researchers with varying levels of technical competency. This offers a major upgrade to our current environment and will comprise user-centric, contemporary bioinformatic workflows, support opensource tooling, and enable shared workspaces between Genomics England and Partners. COVID-RE must serve the immediate COVID-19 research effort and may also advance the transformation of Genomics England’s platform infrastructure, which is a key enabler to the research community. There are two key value streams provided by the COVID-RE. Firstly ‘raw data to analytics ready data’ stream which must permit the collection of data from multiple locations from unstructured, through semi and fully structured forms and transforms them the appropriate data model, data store based on their ongoing use and availability. Secondly the ‘discovery to insight’ stream that supports researchers by providing the capability and the framework to support their User Journey from understanding the data available to them through to providing the necessary analytics and publishing tools. The newly established COVID-RE will provide the following: • Seamless interface with cloud storage and compute capabilities. • A unified data platform containing datastores appropriate for all the required clinical and genomic data types. • Data integration capability to deliver analytics-ready datasets into domain-specific data stores across batch and streaming integration patterns. • Standards-based access services providing secure, fast, flexible, robust and auditable access to TRE data assets. • Applications to support the research user journey from an exploration of data sets, through cohort building, analysis and publication - providing tools appropriate to a variety of user requirements (ways of working) and programming competency. • Applications and workflows can be delivered natively within the platform, however COVID-RE must enable access to container-based applications and, via a set of hardened APIs, to a relevant external service (as long as security is maintained). Details of the sublicence model via the Genomics England Trusted Research Environment are supplied within the processing activities section of this agreement. The additional clinical data is key for researchers to be able to understand and infer clinically relevant and actionable findings. The aim is to provide a high quality, diverse clinical dataset, detailing each participant’s journey and to understand pre-existing conditions and also the early behaviours in the disease course. Genomics England currently have an agreement to receive NHS Digital data for the extant 100,000 Genomes Project participants (DARS-NIC-12784) and have seen first-hand the depth and quality of the data and how it has aided researchers. To that aim, the significant gap in the current data collection for the COVID-19 project can be addressed by the non-standard NHS Digital data feed. The SUS APC feed would provide as close to real-time data for researchers and in the current climate is vital to ensure no delays. The associated COVID-19 data-sets (including NHS111 and CV19 Testing Data in particular which will be requested under a future version of this agreement) will add to the detail in patient journey course, flagging for instance how participants were monitoring their symptoms and was there an associated poor prognosis. The group of survivors eligible for recruitment for this study are generally healthy individuals who have suffered critical illness. It is anticipated this cohort will grow to approx 36,000 participants. The data would be available for analysis alongside the extant Genomics England data set of 100,000 Genomes Project participants and would be made available to approved researchers worldwide as per existing governance procedures. This 100,000 Genomes Project data set will act as an appropriate matched control group. Data on the 100,000 Genomes Cohort will be provided to NHS Digital for linkage - (as they will act as a control cohort), as well as new additions for the GenOMICC Study. The below detail pertains to the background of the study and susceptibility to infection sets out why Genomics are investigating. The origin of the GenOMICC study was focused on smaller cohorts of participants. However, in collaboration with the COG-UK group, this is now expanded to whole genome sequence of approx. 36,000 affected individuals as set out above. GenOMICC Study Background: Susceptibility to infection is profoundly heritable (Sorensen et al. 1988). Patients who develop life-threatening illness following infection with usually innocuous pathogens, such as influenza (Miller et al. 2010), are genetically different from the rest of the population (Albright et al. 2008). Understanding the genetic mechanisms of susceptibility may yield new therapeutic targets (Baillie 2014) that can be used to make susceptible patients more similar to individuals who are resistant to, or tolerant of, specific pathogens. The genetic mechanisms of susceptibility to infection are likely to be highly pathogen-specific and may even have opposing roles in different infections (as for CCR5 variants in HIV (Huang et al. 1996) and WNV (Glass et al. 2006) infection). Pathogen-specific interventions (e.g. small molecules to inhibit an enzyme or receptor that is dysfunctional in resistant individuals) would therefore be protective to the host in a similar way to antibiotics, with the advantage that it is conceptually more difficult for any one pathogen to evolve resistance to such a therapy. A second, more challenging problem arises in patients who become critically ill following infection. The patterns of immune-mediated organ dysfunction, immunoparesis, and death are very similar in severe infections and sterile systemic injuries (such as burns, haemorrhage, pancreatitis and trauma). Ultimately, death is a consequence of the host response to injury (Angus and Poll 2013), through final common pathways of organ failure that are clinically and biochemically evident, and unrelated to the original precipitant. Broadly, the severity of critical illness follows directly from the severity and duration of the initial insult. In bacterial sepsis, early antibiotics are the mainstay of therapy; in influenza, early antivirals; in haemorrhage, early resuscitation; in trauma, urgent action to prevent secondary injury. There are no therapies with which to modulate the host response to systemic injury. There is a lack of direct evidence of heritability for outcomes of critical illness, due in part to difficulties in defining and quantifying the heterogeneous multi-organ dysfunction syndrome (MODS), and in part due to the rapid pace of change in critical care medicine, making it impossible to tackle this question in long term outcome studies. However, clinical and biological evidence support the hypothesis that the pathogenesis of MODS is immune in origin (Angus and Poll 2013). Hence, predictions can be made from the extensive knowledge of other immune conditions. Whether or not MODS is considered to be an autoimmune or infectious condition is moot: these conditions share a great deal of similarity in genetic predispositions, cell types and mechanisms of pathogenesis. It is therefore very likely that propensity to survive MODS has a heritable component, and there is some direct evidence in support of this hypothesis (Rautanen et al. 2015). If this is the case, then the identity of the specific variants that contribute to outcome could potentially be utilised to design therapies to promote survival after the onset of MODS. This study aims to identify genetic predisposition to specific syndromes of critical illness. Specifically, susceptibility to life-threatening infections caused by an identified pathogen, and susceptibility to death following the onset of organ failure due to sepsis or sterile injury. In order to maximise the probability of identifying host genetic loci associated with susceptibility, Genomics England will restrict some analyses to younger individuals in good general health and lacking in known predisposing factors. The same principle was used to determine an upper age limit for inclusion for some analyses. With advancing age, there is an increase in undiagnosed co-morbidity, frailty, and susceptibility to serious complications of infection or critical injury. There is therefore an increase in the probability of susceptibility to, and mortality from, critical illness that is consequent upon non-genetic factors. Withdrawal: Participants are freely able to decline participation in this study or to withdraw from participation at any point without suffering any implied or explicit disadvantage. All patients will be treated according to standard practice regardless of whether they participate. The following options of withdrawal will be made available to participants: 1. Partial withdrawal. Data WILL continue to be updated and used for research, but no further contact will be made with the participant 2. Full withdrawal. • no further contact will be made with the participant; • data will not be updated from health records; • data will not be removed from research that is underway or has already been done, and an audit record will be maintained to confirm participation. Consent version and lifecourse follow-up The first COVID positive patient was recruited to the GenOMICC study in March 2020. All patients (c.1500) currently recruited to the GenOMICC study are on consent and Protocol version 1.08 (which allows for COVID research but not longitudinal life course follow up). This Protocol v2.1 (submitted as part of this DARS request) was REC approved on 23rd March 2020, IRAS IDs are: 269326 & 189676 (https://www.hra.nhs.uk/covid-19-research/approved-covid-19-research/269326/). Genomics England will attempt to reconsent all patients recruited under version 1.08 onto the newly amended materials (version 2.1 – submitted as part of this DARS application) which allows for longitudinal lifecourse follow-up. For any patients on the v1.08 Protocol for which Genomics England are not able to seek consent, Genomics England will not be requesting their data for linkage. For patients prospectively recruited onto the v2.1 Protocol and consent materials, Genomics England will be requesting their data for linkage and follow-up . Datasets requested from NHS Digital: Consideration has been be given to assess and ensure that access to the data sets requested are inline with COVID-19 related purpose, in terms of the restrictions set out in Reg 3(1) COPI, below are details of what each of the data sets will at a high level provide insight to. Hospital Episode Statistics: Outpatients, Admitted Patient Care, Critical Care and Accident & Emergency/ ECDS. These datasets provide the core clinical data for participants and are vital to the provision of a detailed medical history for participants. Diagnostic Imaging Dataset. This provides invaluable, detailed information to build on participants' phenotypes, e.g. tumour size and spread in cancer, adding to the understanding of patients' histories on individual and cohort level and their relationship with genomic alterations. Secondary Uses Service datasets. The minimal latency in availability of these datasets is highly desirable for the research objectives set out in this project. Mental Health Data sets: The 100,000 Genomes Project includes recruitment of psychiatric diseases and others with mental health phenotypes: intellectual disability and seizures are some of the most prevalent conditions within the Project. To-date nearly 10% of project participants have a mental health record. Mental health data are therefore vital in ensuring that a complete and relevant medical history is available for all participants. Cancer Registration Data sets: To see the incidence of cancer within the cohort Mortality data are essential for performing survival analyses and as a metric for success of medical care: this is crucial information for research in combination with other medical history. Cause of death information is vital in order to determine if mortality is related to the primary disease of a participant or to highlight unforeseen trends. Knowledge of participant death is also vital for the correct analysis of medical timeline data and for the management of participant cohorts. COVID datasets: These datasets, including SGSS (Second Generation Surveillance System Data Set) and CHESS (COVID-19 Hospitalization in England Surveillance System) will be crucial to identify early Prognostic features in those affected with Coronavirus. Assessment of the datasets has been undertaken and NHS Digital are satisfied that they are necessary for the COVID-19 work being undertaken. Genomics have confirmed that all research which is approved from the GenoMICC study using the data for the COVID-19 specific purposes will be published here https://www.genomicsengland.co.uk/about-gecip/research-2/ Genomics England Industry access: Genomics England works with industry through its Discovery Forum. The Forum provides a platform for collaboration and engagement between Genomics England, industry partners, academia, the NHS and the wider UK genomics landscape. Industry partners comprise pharmaceutical, biotech and diagnostic companies, and those specialising in laboratory and data analysis. These companies have joined the Forum to work in a pre-competitive environment with access to a selection of genomic and associated clinical data. Ultimately, the Discovery Forum aims to help turn research findings into treatments, diagnostics and benefits for patients as soon as possible. As the Discovery Forum is a collaborative venture, no fees are levied on participating organisations to access the COVID data, however they are charged based on storage and compute, such as running their own bioinformatics pipeline. All members of the Forum are obliged to publish all findings and research at the point at which intellectual property for any product is protected. Participants in the 100,000 Genomes Project have been asked explicitly to give consent for commercial companies to access their de-identified genome and health data. The Forum was created in July 2017 and allows industrial partners to report back to Genomics England on what aspects of the data are proving to be most useful to their research studies, what data is missing and how the data should be collected and developed further so it is captures what industry needs, in a format that is compatible with their research and data systems. These partners act as a 'critical friend' and have made many helpful suggestions to increase the likelihood of successful research in the future for all those using Genomics England's landmark data set. The lawful basis for processing Participant Data under the General Data Protection: Regulation (GDPR) used by Genomics England is legitimate interests as set out under Article 6(1)(f) of the GDPR. It is necessary for Genomics England to process Participant Data for its legitimate interests in carrying out medical research and in providing reports used by clinicians in their care of Participants. It is necessary for Genomics England to process Participant Data for its legitimate interests in carrying out medical research and in providing reports used by clinicians in their care of Participants. The processing is necessary to support and enable Genomics England's legitimate interests in enabling new medical research on using genomics in health care, and on the causes, diagnosis and treatment of COVID-19. Patients and the public will be at the heart of this programme. Initially the researchers will involve the extant 35 strong Genomics England Participant Panel and then we will add others who have been affected by COVID-19 at a later point. These participants and members of the public will be represented on all committees and working groups and will also meet separately. The beneficiaries are: o Participants - through the work Genomics England do will ultimately influence their care; o researchers and industry - by giving them access to a unique ground-breaking resource of genomic data combined with life-course clinical data; o and the wider public - by accelerating the uptake of genomic medicine making it available to patients in the UK. The lawful basis for the release and use of the confidential data being shared under this version of the agreement is Regulation 3(4) of the National Health Service (Control of Patient Information Regulations) 2002 (COPI) to require NHS Digital to share confidential patient information with organisations entitled to process this under COPI for COVID-19 purposes. The application of this has been based on the information provided in the Whole genome sequencing of patients severely affected by COVID-19 funding proposal from The GenOMICC - COVID Genomics UK (CoG-UK) partnership which was supported by the CMO of England and the CFO of the DHSC.

Expected Benefits:

Access to the data will enable the research to discover new rare and common variants alongside new multi-omic biomarkers that underpin host response to infection, allow investigation of the impact of viral genomic features on outcomes and allow creation of a polygenic risk score, which may detect risk of severe response to similar viruses. The prospective component could allow nested clinical trials or case-control resources to add value to this study by detecting variants, which stratify response or predict outcomes. Although the 100,000 Genomes Project and the Genomic Medicine Service may include participants biased to specific disease ascertainment, the scale of these resources and the presence of parents helps compensate for this problem. Specifically the short term (6 months) and medium term benefits and outcomes from this programme of research anticipated are; • Variants enable Polygenic Risk Score to predict greatest risk and avoid ITU • Pre-morbid clinical conditions or biomarkers of risk and rapid NHS uptake to avoid ITU • Identify novel therapies or precision interventions for rapid national trials • Longitudinal life course sequel of COVID-19 for pandemic planning • Patient benefit: o Providing improved clinical understanding of disease progression in COVID-19 o Correlation to disease progression and pre-morbid status o Identification of susceptibility genes o Develop a biomarker test(s) to predict an individual’s response to SARS-CoV-2 exposure, considering both COVID-19 severity and vulnerability to infection. o Identify targets that can be used in to inform development of new treatments • New scientific insights and discovery: o with the consent of patients, creating a database of 35,000 whole genome sequences linked to continually updated long term patient health and personal information for analysis by researchers. o Correlation of host and viral genomic data o Potential to provide improved testing for future pandemics o Aide researchers to identify novel targets for vaccines and therapy o Identification of highly penetrant rare variants in genes and pathways relating to viral susceptibility or immunodeficiency. o Genome wide association studies (GWAS) using common variants to identify genes and pathways associated with viral response. These analyses will be aligned with other COVID-19 research consortia. o Rare variant burden analysis to identify genes and pathways enriched in rare variants associated with viral response • Accelerating the uptake of genomic medicine in the NHS: working with NHSE and other partners to deliver a scale-able WGS and informatics platform to enable these services to be made widely available for NHS patients. WGS could potentially provide the most accurate diagnostic test for COVID 19. • Stimulating and enhancing UK industry and investment: by providing access to this unique data resource by industry for the purpose of developing new knowledge, methods of analysis, medicines, diagnostics and devices. • Increasing public knowledge and support for genomic medicine: delivering an ethical and transparent programme which has public trust and confidence and working with a range of partners to increase knowledge of genomics. Yielded Benefits: The 100,000 genomes project has been hugely successful and provided numerous academic and clinical publications and discoveries. The success of the project has been based on the strength of the clinical data provided by NHS Digital. Understanding this significant value is why Genomics England are so keen to add NHS Digital data to its clinical data source for the GenOMICC study. The GenOMICC study is very much in its infancy and whole genome sequencing has only begun in the last month. It is thus too early to demonstrate any significant outcomes. These outcomes will only gain power and relevance with more prospectively recruited patients and a breadth and depth of clinical data. By providing this to researchers, Genomics England can provide them the necessary tools to explore the genomic data.

Outputs:

Genomics England completed sequencing 100,000 genomes at the end of 2018 (https://www.newscientist.com/article/2187499-uk-dna-project-hits-major-milestone-with-100000- genomes-sequenced/). During 2018, the Genomics England Research Environment was established to allow research access to de-identified genomic and clinical data received from NHS Digital. Thirty disease and cross-cutting GeCIP research domains were requested and approved, with now over 3000 GeCIP members given access to the Research Environment. Genomics England had also created the industry Discovery Forum to provide a platform for collaboration and engagement between Genomics England, industry partners, academia, the NHS and the wider UK genomics landscape. Although the 100,000 genomes project has completed recruitment, Genomics England is committed to continue gathering life-long clinical data from the participants and making these available in the Research Environment. Genomics England will be responsible for the onward workflow, in partnership with Illumina for the delivery of 30X whole genome sequences, subject to passing appropriate sequence QC, into the Genomics England data centre. Alignment and variant calling will be performed alongside the potential application of bespoke immunodeficiency panels as part of the Genomics England bioinformatics pipeline analysis. Genomic data will be released into the Genomics England Trusted Research Environment where it will be linked with associated clinical data. The GenOMICC study is backed by £28 million from Genomics England, UK Research and Innovation, the Department of Health and Social Care and the National Institute for Health Research. Illumina will sequence all 35,000 genomes and share some of the cost via an in-kind contribution. A press release on 13/05/20 included a comment from Health and Social Care Secretary Matt Hancock: “As each day passes, we are learning more about this virus, and understanding how genetic makeup may influence how people react to it is a critical piece of the jigsaw. “This is a ground-breaking and far-reaching study which will harness the UK’s world-leading genomics science to improve treatments and ultimately save lives across the world.” To date, nearly 3000 patients have been recruited into the project. As of March 2020, the Genomics England Research Environment contained 107,694 genomes, of which, 33,461 were cancer and 74,233 were rare diseases. The Research environment also contained clinical data on 89,157 participants (this is because cancer participants have two genomes submitted). The clinical data for 17,246 cancer participants includes clinical data from NHS Digital (HES OP/APC/ CC and AE) but also cancer specific data from Public HeaLth England Cancer Registry (NCRAS). The combination of clinical data for all 100,000 participants totals about 5m records. As the GenOMICC study prospectively recruits participants, the aim will be to use the existing 100,000 participants and age and match-ranked controls for those entered into the study. As Genomics England prospectively enrolls more participants into the study, Genomics England plans further releases of genomic and clinical data, including clinical data received from NHS Digital and viral and host genomic data, into the Research Environment on the following dates, in order to continue support for, and to further develop, this ground-breaking resource: • 3rd August 2020 • 7th September 2020 • 5th October 2020 • 2nd November 2020 Specific outputs over the period of this agreement are therefore to release updated genomic and clinical data for the 100,000 genomes participants and GenOMICC participants into the Research Environment on the dates shown above.

Processing:

All organisations party to this agreement must comply with the Data Sharing Framework Contract requirements, including those regarding the use (and purposes of that use) by "Personnel" (as defined within the Data Sharing Framework Contract i.e.: employees, agents and contractors of the Data Recipient who may have access to that data). Genomics England provide NHS Digital with a cohort for linkage and they receive data from NHS Digital on a monthly basis. Every month, Genomics England provide an updated cohort to NHS Digital who provide the historical data for the extra cohort members. The cohort is already flagged with NHS Digital so Genomics England will only receive the historical data for the extra cohort members each month. The first stage of processing focuses on quality verification. This ensures that the data set is complete, accurate and complies with the NHS data dictionary or relevant specification. Participant identifiers in the dataset are verified against Genomics England's participant details and any updates required to identifiable data fields, e.g. dates of birth, are highlighted. Finally, the data set is reviewed against recent participant withdrawals so that any withdrawals notified after the data application was made can be removed from the data sets. Following this the data are de-identified, as all subsequent processing can be performed without direct identifiers. Genomics England has compiled lists of identifiable and sensitive fields for each data set in line with details provided by NHS Digital and following internal review of data sets. De-identification is a key facet of the Genomics England resource. De-identified data are uploaded to a secure research environment hosted by Genomics England on a monthly basis, where they are linked to participant genomes and primary clinical data. The second stage of processing involves the selection of a de-identified cohort of participants that fulfil a specific research request. Researchers are members of a Genomics England Clinical Interpretation Partnership (GeCIP) or the Discovery Forum. Research requests are assessed to ensure that they are included in the approved use purposes set out in the Genomics England Protocol, and fall within the scope of the relevant GeCIP or the Discovery Forum. Researchers declare any data they wish to bring into the research environment and any tools they wish to use for analysis. The third stage of processing is the analysis of the de-identified data sets within the research environment. Researchers perform all the analysis and processing within the environment: they do not extract de-identified data. Results data are placed in a secure folder for anonymisation verification before extraction. There will be no data linkage undertaken with NHS Digital data provided under this agreement that is not already noted in the agreement. The Research Environment (TRE): All research analysis on the Genomics England dataset will only be carried out via a secure analysis environment hosted within the Genomics England data center - the Genomics England Research Environment. Analytical tools and applications are available within the Research Environment. No sequencing or clinical data are made available for download, users cannot copy or paste out of the Research Environment, and there is limited internet access within it (i.e. whitelisted sites). Movement of files into and out of the Research Environment is governed via an 'Airlock' Policy. Academic researcher access to the Research Environment: Academic researchers access the Research Environment by applying to be a member of a Genomics England Clinical Interpretation Partnership (GeCIP) domain. GeCIP membership is open to any individual, student or member of staff, who is affiliated with a host institution which include the following: o UK academic research institutions (e.g. universities, research institutions etc.) o NHS trusts or authorities o UK and foreign charitable organisations directly related to the focus of the 100,000 Genomes Project o Foreign universities and research institutions that carry out significant research activity o UK and foreign governmental departments that carry out significant research activity (e.g. MRC, NIH, PHE) o Foreign healthcare organisations (private or public) that undertake significant research activity Membership is not open to those who are self-employed or employed by: o private UK healthcare institutions o commercial companies. o To be eligible for data access as a GeCIP member, applicants must meet these requirements: o Their host institution has signed a GeCIP Participation Agreement, which outlines the key principles that members of each institution must adhere to, including the Intellectual Property and Publication Policy. o Their host institution has verified that they are affiliated with that institution. o The applicant's GeCIP domain has submitted a detailed research plan and it has been approved by the Genomics England Access Review Committee (see below). o The GeCIP domain lead has approved the application. Following approval, GeCIP researchers must sign a specific agreement ('GeCIP rules' which is attached) covering their behavior and working practice within the data infrastructure. Data access will not then be granted until a researcher has successfully passed mandatory information governance training. Commercial researcher access to the research environment: Genomics England operates a membership-based forum - the Discovery Forum - which is open to a range of companies world-wide and allows access to the Research Environment. It provides a platform for collaboration between Genomics England, industry partners, academia, the NHS and the wider UK genomics landscape. Each Discovery Forum member signs a Data Access Agreement with Genomics England. This states the research purposes which the company is authorised to carry out and stipulates the number of genomes sequences that can be accessed. It covers the Company's behavior and working practices: in particular it binds users to Genomics England's Airlock Policy, Information Governance, IT Security and Data Protection Polices. Companies need to nominate named individuals to be their Researchers who must complete information governance training before accessing data. Once the Data Access Agreement is in place, each research project undertaken by the Company within the Research Environment must receive prior ARC approval. Discovery Forum members access the Research Environment in a similar manner to GeCIP Researchers: all research is carried out within the Research Environment, and any movement of results out of the environment occurs only through the Airlock Process. The Access Review Committee: The Access Review Committee (ARC) provides an independent examination of requests for data access, with regards to the acceptable uses of the Genomics England dataset which are outlined in The National Genomic Research Library Protocol the 100,000 Genomes Project Protocol and Data Access and Acceptable Uses Policy. The ARC comprises external scientific experts, patient representatives and members of Genomics England's Participant Panel which is made up of participants and parents/carers involved in the 100,000 Genomes Project. The Airlock Process: The Genomics England Research Environment has been developed with the intention that all data analysis is carried out within it and that the only data to leave it are analytical results. An Airlock process has been established which enables material (data, files, tools etc.) to be moved in or out of the Research Environment in a controlled and supervised manner; facilitating research and discovery, while maintaining control of security and access. Removal of results therefore requires an Airlock request. The following rules are applied to all Airlock requests: o All relevant details of the files to be transferred must be provided with every request. o All files transferred may be checked by Genomics England to ensure compliance with the relevant policies. Users will be notified of any files rejected along with the reason for the rejection. o All files transferred will be checked for viruses and malware and those failing this test will be rejected. It is the responsibility of the requestors to resolve such issues before re-submitting the file for transfer. o Files requested for transfer are assessed using the following criteria: • whether the request aligns with the user's ARC approval • whether the request can clearly be demonstrated to be aligned with a registered project in the Research Environment • any data security implications • any disclosure risks • the technical feasibility and associated cost of the request • when importing data, its scientific value to the community of researchers within the Research Environment, and when and how it will be shared • when importing data, checks will be performed to ensure that the data importer owns the data and holds the correct consents and approvals. The Airlock process is governed by the Airlock Policy (attached), which defines the process and governance of the Airlock process. A set of Airlock Policy Guidelines presents the rules-of-thumb/principles that will be referenced by both the researcher (during preparation of analysis results) and the output checker (during output-checking). Analysed results are inspected to ensure they cannot be used to disclose the identity of the participant. Checking of statistical output by the Airlock Review Team is governed by a generalizable set of principles that guide individual decisions and ensure flexible evaluation of the Genomics England dataset. By using a principles-based approach where each case is assessed individually the security of the dataset is maintained by exporting only 'safe' data. Review of transfer requests resulting in public-sharing/publication of data will be checked more stringently. Any approved Airlock export can only be used for the specific use detailed in the original export. The Research Environment contains External Data (for example Hospital Episodes Statistics [HES]) which is subject to data sharing framework contracts and data sharing agreements between Genomics England and other parties that dictate how the data may be used and what can be exported. Where an export contains External Data, Genomics England will always apply the requirements placed on them as conditions of having access to the data. In some cases, particularly concerning the export of individual-level data, these will be more conservative than those applied to 100,000 Genomes Data alone. The Airlock Review Team is a delegation of the Genomics England Chief Scientist responsible for oversight of all airlock requests in accordance with the Airlock Policy and the group's Terms of Reference. It comprises: o Senior Information Risk Office (SIRO) o Technical Lead o User Community Representative o Bioinformatics Director o Caldicott Guardian o Chief Scientist Sub-licencing: Genomics England has developed the Research Environment to allow registered third parties to access pseudonymised versions of the data that it holds, for the purposes of approved research. The Research Environment contains External Data (for example Hospital Episodes Statistics [HES]) which is subject to data sharing framework contracts and data sharing agreements between Genomics England and other parties that dictate how the data may be accessed. Genomics England will always apply the requirements placed on them as holders of External Data to users of the Research Environment as a condition of having access to the data. The data is NOT for onward sharing outside of the Research Environment. Data control: For clarity, the University of Edinburgh is responsible for acquisition of primary clinical data. That relates to data acquired at patient registration. Genomics England in its provision of whole genome sequencing are applying to NHS Digital for secondary clinical data to link to the genomic data. With regards to data provided by NHS Digital, Genomics England are the sole data controller. To this end, staff and academics from the University of Edinburgh are required to join a GeCIP to access the secondary clinical data from NHS Digital within the Research Environment. Genomics England provides NHS Digital with linking data in order to receive longtitudinal data sets. These data sets are delivered to Genomics England by NHS Digital on a monthly basis having been approved by the NHS Digital IGARD. Genomics England identifies the linking data and agrees with NHS Digital the scope of the longtitunidal data being provided. Genomics England determines the method of de-identification and storage within the research environment and secures this data for use by approved researchers only. Genomics England determines who these researchers are. Genomics England is the Data Controller for longitudinal data sets processed in the Genomics England Research Library. Researchers in academic, educational or commercial organisations Access to deidentified data in the research environment which will include longitudinal data sets (HES etc) provided by NHS Digital. Access to the Research Environment only allowed under access agreement. The individual researchers are Data Controllers when carrying out research within the research environment. Data disseminated under this agreement for COVID-19 purposes will be restricted to the GEL Covid research environment. Only COVID-19 research approved studies will be granted access to the data. All research which is granted access for COVID-19 purposes must be employed or engaged for the purposes of the health service as the request for data is to support research that has been set as a priority by the CMO. Research which is approved using the data for the COVID-19 specific purposes will be published here https://www.genomicsengland.co.uk/about-gecip/research-2/ Data Processors: o Only summary level data can be removed from the environment. o Approved researchers will only be able to access Lifebit’s PaaS CloudOS through a virtual desktop. o Secondary data will be ingested into CloudOS. o CloudOS will be hosted within GEL’s London AWS (Amazon Web Services) environment - All data is encrypted in transit and at rest. o CloudOS controls access to the secondary data. o A security and DPIA assessment will be conducted prior to loading live Lifebit: Lifebit has been selected as platform partner to deliver the Research Environment after reviewing several proposals. The UK-based SME offered a proven and innovative technology solution offering a blend of robustness and ease of use. Lifebit CloudOS provides a secure and collaborative workspace to enable researchers to easily perform COVID-19 genomic data analysis. The platform will deliver an intuitive, integrated and collaborative user experience and enables fast, effective COVID-19 research outcomes across a wide range of academic and biotech/pharma researchers with varying levels of technical competency. Data Minimisation: This will be limited to the selected cohorts and additions and deletions will be updated regularly. Cohort Size: Briefly, there will be effectively 2 cohorts, one is the 100K Project (CONTROL COHORT) which is about 92,000 participants. This will be a near enough static list. The second cohort is the covid-19 recruited cohort participants.


Project 2 — DARS-NIC-12784-R8W7V

Opt outs honoured: No - consent provided by participants of research study, No - data flow is not identifiable (Reasonable Expectation, Consent (Reasonable Expectation))

Sensitive: Sensitive, and Non Sensitive

When: 2016/04 (or before) — 2020/07.

Repeats: Ongoing, One-Off

Legal basis: Informed Patient consent to permit the receipt, processing and release of data by the HSCIC, Health and Social Care Act 2012 – s261(2)(c)

Categories: Identifiable, Anonymised - ICO code compliant

Datasets:

  • Hospital Episode Statistics Accident and Emergency
  • Hospital Episode Statistics Admitted Patient Care
  • Hospital Episode Statistics Outpatients
  • Hospital Episode Statistics Critical Care
  • MRIS - List Cleaning Report
  • MRIS - Flagging Current Status Report
  • MRIS - Cause of Death Report
  • Mental Health and Learning Disabilities Data Set
  • Mental Health Minimum Data Set
  • Bridge file: Hospital Episode Statistics to Diagnostic Imaging Dataset
  • Diagnostic Imaging Dataset
  • Bridge file: Hospital Episode Statistics to Mental Health Minimum Data Set
  • Patient Reported Outcome Measures (Linkable to HES)
  • MRIS - Cohort Event Notification Report
  • MRIS - Members and Postings Report
  • Mental Health Services Data Set
  • Emergency Care Data Set (ECDS)

Objectives:

The aim is to create a new genomic medicine service for the NHS – transforming the way people are cared for. Patients may be offered a diagnosis where there wasn’t one before. In time, there is the potential of new and more effective treatments. The project will also enable new medical research. Combining genomic sequence data with medical records is a ground-breaking resource. Researchers will study how best to use genomics in healthcare and how best to interpret the data to help patients. The causes, diagnosis and treatment of disease will also be investigated. We also aim to kick-start a UK genomics industry. This is currently the largest national sequencing project of its kind in the world. Genomics England seeking to obtain information from participants’ medical records that span their entire lifetime. The DNA sequence, and information from patients’ health records and any other information given to the Project will be collected and stored securely by the Project as a resource for use by approved researchers for future scientific and medical purposes during the life and after the death of participants. Diagnoses arising from the sequencing and analysis of the participants’ DNA are already being fed back to Participants now and for many they are receiving a diagnosis for the first time. Genomic England’s legacy will be a genomics service ready for adoption by the NHS, high ethical standards and public support for genomics, new medicines, treatments and diagnostics and a country which hosts the world’s leading genomic companies. It is a bold ambition with benefits for all.

Yielded Benefits:

Over 41,000 Genomes sequenced as of December 2017. Participant stories can be found at: https://www.genomicsengland.co.uk/alexs-story/ Genomics England has built upon its commitment to lead on Governments technology and innovation agenda by forging partnership with industry. Examples of this include a new industry collaboration with leading life sciences companies Inivata and Thermo Fisher Scientific to improve understanding of cancer. Public Health England has announced that Whole Genome Sequencing (WGS) is now being used to identify different strains of tuberculosis (TB). This is the first time that WGS has been used as a diagnostic solution for managing a disease on this scale anywhere in the world. The technique, developed in conjunction with the University of Oxford, allows faster and more accurate diagnoses, meaning patients can be treated with precisely the right medication more quickly. Genomics England has now engaged devolved nations and is recruiting participants from Scotland and Wales. Update May 2018 Over 60,000 genomes have now been sequenced and over 12,000 clinical reports have been issued to NHS Genomic Medicine Centres. Thirty disease and cross-cutting research domains have had their plans approved and now have access to 100,000 Genomes Project data. The number of users with access to the Genomics England Research Environment is now over 1,300. Twelve publications have arisen from or refer to the 100,000 Genomes Project during the last year, including: • The 100,000 Genomes Project: bringing whole genome sequencing to the NHS. Clare Turnbull et al. BMJ 2018; doi: https://doi.org/10.1136/bmj.k1687 (24 April 2018) • Identification of rare sequence variation underlying heritable pulmonary arterial hypertension. Nicholas W. Morrell et al. Nature Communications 2018;9; doi:10.1038/s41467-018-03672-4 (12 April 2018) • Introducing genomics into cancer care. Sue Hill BRJ Surg 2018;105(2):e14–e15 (17 January 2018) • Missense variants in the X-linked gene PRPS1 cause retinal degeneration in females. Alessia Fiorentino, Kaoru Fujinami, Gavin Arno et al. Hum Mutat 2017; doi:10.1002/humu.23349 (17 October 2017) See https://www.genomicsengland.co.uk/category/updates/ and https://www.genomicsengland.co.uk/about-gecip/publications/ for details of news and publications. Genomics England created the Discovery Forum in July 2017 to build on the work of the GENE Consortium. The Discovery Forum provides a platform for collaboration and engagement between Genomics England, industry partners, academia, the NHS and the wider UK genomics landscape.

Expected Benefits:

The overall benefits realisation for the project are established by the Department of Health (DoH). Each individual research study will have their own specific aims and benefits that underpin the DoH benefits. The 10 key benefits have been drafted as: 1. It is anticipated that many of the circa 20,000 patients with rare diseases who provide their genomes for sequencing as part of the Project will receive a formal diagnosis for the first time. 2. The speed of processing the data from Whole Genome Sequences should be greatly increased with an associated acceleration of diagnosis – something that previously has taken years to identify, under the Project this should be possible in a few months. 3. It is hoped that Genomic diagnosis as a result of the Project will enable clinicians to make cancer treatment more personalised by determining how effective treatments like Herceptin or radiotherapy are likely to be. This will improve the effectiveness of treatments and may provide financial savings. 4. Although not all patients involved in the Project will benefit from a significant improvement in their own condition, for most the benefit will be in knowing that they will be helping people like them in the future. 5. The Project has already identified issues with the current approach for collecting DNA from cancer tumours. A current study within the Project is looking at identifying optimum methods for collecting DNA from cancer tumours. This is something which previously that has been incredibly difficult to do at scale and which is essential for high quality Whole Genome Sequencing. 7. As a result of the high standards of ethical practice and transparency underpinning the Project, the case will be made for collecting genomic data, linking it the phenotypic data and sharing it in a controlled way with academics, researchers and industry. 8. The creation of NHS Genomic Medicine Centres will allow engagement and feedback to patients with rare diseases and cancer from the Project and will provide the infrastructure to bring about transformational change in the NHS so that it continues to deliver world-leading healthcare in the future 8. As a result of the Project, the NHS and Public Health workforce will benefit from additional education in genomic medicine, including 550 places for an MSc in Genomics Medicine over the next 3 years, increased capacity in the scientific workforce, and a legacy of education and training in genomics for the future workforce. 9. The secure dataset of genomic and clinical data which is created as a result of the Project will enable clinicians, researchers and industry to discover new variants with a view to creating new diagnostics and treatments. 10. The Project will kick-start the development of the UK industry in Whole Genome Sequencing. The global genomics market was valued at an estimated £7.6 billion in 2013 and is expected to reach over £13 billion by 2018.

Outputs:

All outputs from research environments will be anonymised. The outputs will relate to the purposes described above for each of the research areas. Proof of concept outputs will be produced during the summer of 2015, with a move to researcher created outputs during the Autumn of 2015 onwards. The specific outputs are defined by the research groups and then verified for being anonymous when an extract is requested.

Processing:

Amendment - Genomics England has engaged the Clinical Trial Service Unit & Epidemiological Studies Unit (CTSU), at the University of Oxford to act as a Data Processor. The Data Processor will provide data handling services related to the acquisition and cleaning of registry based data provided by the HSCIC for the consented Genomics England participants. The University of Oxford will access identifiable record level data in the performance of this function. The scope of data processing activities is limited. All data processing activities will be performed in the Genomics England Data Centre and no data will leave these servers. Genomics England will remain the Data Controller and remain responsible for all aspects of system security and access control. Oxford will not access data remotely or take any data away from the Genomics England data centre. There are three principle stages of processing: 1. Data acquisition, cleansing, quality verification, linkage and de-identification 2. Identification of participant cohorts that meet research scope parameters 3. Data analysis for research using de-identified data The first stage focuses on the acquisition of data and quality verification to ensure it is complete, accurate and complies well with NHS data dictionary and other data standards that apply. The data is provided over a period of time (related to the treatment of participants) and associated with their longitudinal data from other NHS sources. The intention over the course of the Project is to link this data with other data, such as primary, secondary, social and participant provided data. For this application the request is limited to HES Data. The richness of the high quality data sets are crucial to the success of the 100,000 Genome Project in delivering value to the NHS. The evaluation of whole genome sequencing (WGS) data in the context of rich and extended phenotypes derived from electronic health records, such as blood pressure, cholesterol, glucose, and pharmacogenomics, adds significant value. The richness of the Project dataset will allow us to move beyond the primary phenotype of the rare disease, cancer or infectious disease that led to the patient’s enrolment to evaluate the WGS in the context of other continuous traits, diseases and response to therapy. As soon as the data completeness and quality has been confirmed the data is de-identified as all subsequent processing can be performed without direct identifiers. This de-identification is a key facet of the 100,000 Genomes Project. The second stage is focused on the confirmation and approval of valid research scope and selecting a de-identified cohort of participants that fulfil the focus of the research request. The Researchers will BE members of a Genomics England Clinical Interpretation Partnership (GECiP) or a GENE Consortia. GECiP. The overall aim of the Genomics England Clinical Interpretation Partnership (GeCIP) is to create a thriving, sustainable environment for researchers and clinical (NHS) disease experts. The activities of GeCIP will inform NHS feedback to clinicians and the multidisciplinary teams by providing enhanced data interpretation, additional information on pathogenicity of variants, and functional characterisation. GENE Consortia. Genomics England are running an Industry trial during the calendar year 2015. 12 pharma, biotech and diagnostics companies have committed to invest monetary and FTE resources to understand how best to realise the value from working with Genomics England, our Bioinformatics Platform Partners and the wider NHS. Across the 100,000 Genome Project Genomics England will be at the forefront of Lifescience Programmes in the UK and Worldwide. For example Gene discovery in the 100,000 Genomes Project will create significant opportunities for scientific innovation and place particular emphasis upon national and international collaborations. Where possible, we will work with key international programmes including Development Disorders (DDD) and Orphanet, and complement the work of the International Rare Diseases Consortium (IRDC). All research requests will be assessed to ensure they are included in the approved use purposes set out in the Genomics England Protocol and that it complies with the boundaries of the research group (Genomics England Clinical Interpretation Partnership or GENE consortia). Each research request will be for a sub-set of the de-identified data, with the specific data requirements specified in the request. The researchers also declare any data they wish to bring into the environment and any tools they wish to use for analysis. The third stage is the research analysis of the de-identified approved data sets in the virtual data centre environments. Researchers perform all the analysis and processing within the environments hosted by Genomics England, they do not extract de-identified data. Researchers will use pre-declared data and tools to perform their analysis. If researchers want to extract any anonymised results data, they must first put any such results in a secure folder for anonymisation verification before it can be extracted. A simplified view of the Genomics England Data Flow is shown below. Note the de-identified export boundaries into the Genomics England Core Research Repository Genomics England provide the HSCIC with a cohort for linkage and they receive HES data from the HSCIC on a monthly basis. Every quarter Genomics provide an updated cohort to the HSCIC and the HSCIC provide the historical data for the extra cohort members The cohort is already flagged with the HSCIC so Genomics will only receive the historical data for the extra cohort members each quarter.