Whole genome sequencing to identify cis and trans regulators of the proteome

Summary

The majority of genetic variants causing common human disease are non-coding. Understanding the non-coding genome and identifying regulatory elements and variants is the next major challenge in human genetics. Circulating protein level data are ideal traits to identify such regions due to their biologically-proximal nature.

What are we doing?

We aim use to data from 2924 proteins from blood in 55,000 individuals with whole genome sequencing data to identify regulatory elements and variants affecting protein levels. This will allow us to develop methodology which will be applied to other traits and diseases, providing key insights into the function of the non-coding regulatory genome.

How are we doing it?

We will extend our cis analysis (within 1Mb of the gene coding for the protein) to identify trans regions showing evidence of genetic effects on protein levels in ~55,000 individuals with WGS and protein data in the UK Biobank.

The analysis will comprise 2 parts.

We will perform single variant association analysis genome-wide to identify both common and rare genetic variants which are associated with protein levels. Step-wise conditional analysis will be performed to identify conditionally independent variants, which will then be tested in joint-effects analyses to ensure identified loci robustly associate with protein levels.
We will then perform aggregate burden testing in both coding and non-coding regions of the genome. We will prioritise further analysis of proteins which showed evidence of a cis associations in our preliminary analysis. Due to the high cost of performing these analyses genome-wide, we will perform genome-wide aggregate testing only for those proteins where coding variants in the relevant gene are associated with protein levels to identify trans associations with protein levels.

What happens next?

This project will provide a substantial advance in our understanding of the role of non-coding variants in human disease. It will allow us to develop efficient and cost-effective approaches analysing whole genome sequence data. Our project is important if we are to make major advances in understanding disease mechanisms using whole genome sequencing.

Other collaborators

Prof Michael Weedon (University of Exeter)