12th Annual IGSS Conference • October 28-29, 2021

Integrating Genetics and the Social Sciences 2021

Consequences of natural and mortality selection for genetic discovery in the UKBiobank

Felix Tropf, CREST/ENSAE

After the initial enthusiasm about the availability of large scale genetic and phenotypic data across Europe and the US, more and more research focuses on issues of representativity and selection bias of these studies. First, in parallel to psychological research, most samples are based on WEIRD (western, educated, industrialized, rich, and democratic) populations (Henrich & Al 2010). Our current genetic knowledge from GWASs is based on only around 24 of the approached target population and is highly selective in several ways, overrepresenting individuals with lower genetic risk of mental health problems, BMI, nonsmoking, higher education and from less economically deprived areas (Batty, Gale, Kivimäki, Deary, & Bell, 2020; Fry et al., 2017; Howe et al., 2021; Pirastu et al., 2021; Tyrrell et al., 2021). In fact, whether someone participates in a study or completes a survey is a heritable behavioural trait (Abdellaoui & Verweij, 2021) showing significant genetic correlations with educational attainment and physical and mental health (Adams et al., 2021). Also follow-up studies in the UK Biobank are selective, with participation correlating with the genetic risk for, for example, intelligence, Alzheimer's disease, neuroticism and schizophrenia (Tyrrell et al., 2021). In contrast to sociologists and especially demographers who have a strong focus on representativeness of data sources for the purpose of external validity, geneticists and epidemiologists at times even argue for nonrepresentative samples (Mark Elwood, 2013; Rothman, Gallacher, & Hatch, 2013) since internal validity is at their core and associations expected to be causal and universal. However, participation bias can produce statistical artefacts such as (genetic) correlations or collider bias (Munafò, Tilling, Taylor, Evans, & Smith, 2018). Others have shown how the prediction accuracy of polygenic scores depends on the age or sex composition of the GWAS discovery sample and study design (Mostafavi et al. 2020). Some argued that exposure-disease correlations are still generalizable (Fry et al., 2017), however, we now know that genetic associations are modifiable (Lee et al., 2018) particularly for behavioral and lifestyle outcomes (Tropf et al., 2017). These modifiers bias the GWAS estimates toward those that are mainly true for that overrepresented group. This issue extends to prediction studies, which show a different distribution to the discovery data in terms of moderating traits or even prevalence of a disease (Keyes & Westreich 2019). The current study investigates a general pattern of selectivity in the UK Biobank, namely genetic selection by birth cohort. There are several mechanisms which might lead to the phenomenon that allele frequencies of single nucleotide polymorphisms (SNPs) might be heterogeneous across birth cohorts, amongst others assortative mating, migration, natural selection and mortality selection. While all of those might impact genetic discovery and genetic correlation analyses in the same way as other representativity issues, only mortality selection would be due to data collection issues. Main goals of our study are to quantify the potential bias of genetic selection by birth cohort on genetic discovery studies, provide tools for correction, and attribute the phenomenon to different potential mechanisms causing it. 2. Analytical strategy First, we identify SNPs which are correlated with birth cohorts in a GWAS to detect those with the strongest changes in allele frequencies over time in the UK Biobank. Based on those SNPs, we simulate a phenotype with zero heritability and see whether we nonetheless find SNPs associated with this phenotype in a GWAS. Still associated SNP effects point towards the expected bias in standard GWAS due to genetic birth cohort selection on any phenotype. We furthermore simulate a phenotype with moderate heritability, which is supposed to be homogeneous across birth cohorts. We split the sample by median birth cohorts and investigate the genetic correlation (rG) across time. This rG is simulated to be 1 and tested against this value. In case of heterogeneity (rG < 1), this points towards the bias a genetic correlations in a study design not considering selection. We will utilize genetic structural equation modelling (GSEM) to correct GWAS results for genetic birth cohort selection bias and also to investigate the role of mortality (Timmers et al., 2019) and natural selection (Mathieson et al., 2020) in causing this selection. Finally, genetic principal components are typically perceived as a "mirror for geography" (Novembre et al., 2008). However, we will investigate to what extent those might vary over time and correct for this.

Presenter's website