Case-Control Allele Frequency Estimation (CCAFE)

Methods involving summary statistics in genetics can be quite powerful but can be limited in utility. For instance, many post-hoc analyses of disease studies require case and control allele frequencies (AFs), which are not always published. We present two frameworks to derive case and control AFs from GWAS summary statistics using the odds ratio, case and control sample sizes, and either the total (case and control aggregated) AF or standard error (SE). In simulations and real data, derivations of case and controls AFs using total AF is highly accurate across all settings (e.g., minor AF, condition prevalence). Conversely, derivations using SE underestimate common variant AFs (e.g. minor allele frequency >0.3) in the presence of covariates. We develop an adjustment using gnomAD AFs as a proxy for true AFs, which reduces the bias when using SE. While estimating case and control AFs using the total AF is preferred due to its high accuracy, estimating from the SE can be used more broadly since SE can be derived from p-values and beta estimates, which are commonly provided. The methods provided here expand the utility of publicly available genetic summary statistics and promote the reusability of genomic data. The R package CCAFE, with implementations of both methods, is freely available on Bioconductor and GitHub.

Hayley Stoneman
Hayley Stoneman
PhD Student in Human Medical Genetics and Genomics
Audrey E. Hendricks
Audrey E. Hendricks
Associate Professor of Statistics

I am committed to increasing opportunities for all people to learn about statistics, machine learning, and science. I am motivated to ask novel research questions and ensure the research is robust and accurate. My research interests include developing and applying statistical/machine learning methods across genomics and biomedical informatics to better understand and inform health and disease.

Related