| Title: | Inferring Age-Dependent Disease Topic from Diagnosis Data |
|---|---|
| Description: | We propose an age-dependent topic modelling (ATM) model, providing a low-rank representation of longitudinal records of hundreds of distinct diseases in large electronic health record data sets. The model assigns to each individual topic weights for several disease topics; each disease topic reflects a set of diseases that tend to co-occur as a function of age, quantified by age-dependent topic loadings for each disease. The model assumes that for each disease diagnosis, a topic is sampled based on the individual’s topic weights (which sum to 1 across topics, for a given individual), and a disease is sampled based on the individual’s age and the age-dependent topic loadings (which sum to 1 across diseases, for a given topic at a given age). The model generalises the Latent Dirichlet Allocation (LDA) model by allowing topic loadings for each topic to vary with age. References: Jiang (2023) <doi:10.1038/s41588-023-01522-8>. |
| Authors: | Xilin Jiang [aut, cre] (ORCID: <https://orcid.org/0000-0001-6773-9182>) |
| Maintainer: | Xilin Jiang <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.3.0 |
| Built: | 2026-05-29 09:33:04 UTC |
| Source: | https://github.com/cran/AgeTopicModels |
imputing missing age if you can't find some of them The function does two stage imputation: i. if the individual has other age label – use the mean, min, or max of other age labels for the missing ones. ii. if the individual has no age label – use the mean, min, max for all the diagnosis codes iii. if there is no age info available for any of this code, we will impute it as the mean of all age codes in the data
age_imputation(rec_data_missing_age, method = "mean")age_imputation(rec_data_missing_age, method = "mean")
rec_data_missing_age |
a data frame with missing age info |
method |
use one of the three choices "mean", "min", "max" |
a data frame that is imputed and ready for wrapper_ATM
rec_data_missing_age <- HES_age_example rec_data_missing_age$age_diag[1:10000] <- NA rec_data_imputed <- age_imputation(rec_data_missing_age, method= "mean") cor(rec_data_imputed$age_diag[1:10000], HES_age_example$age_diag[1:10000]) rec_data_imputed <- age_imputation(rec_data_missing_age, method= "min") cor(rec_data_imputed$age_diag[1:10000], HES_age_example$age_diag[1:10000]) rec_data_imputed <- age_imputation(rec_data_missing_age, method= "max") cor(rec_data_imputed$age_diag[1:10000], HES_age_example$age_diag[1:10000])rec_data_missing_age <- HES_age_example rec_data_missing_age$age_diag[1:10000] <- NA rec_data_imputed <- age_imputation(rec_data_missing_age, method= "mean") cor(rec_data_imputed$age_diag[1:10000], HES_age_example$age_diag[1:10000]) rec_data_imputed <- age_imputation(rec_data_missing_age, method= "min") cor(rec_data_imputed$age_diag[1:10000], HES_age_example$age_diag[1:10000]) rec_data_imputed <- age_imputation(rec_data_missing_age, method= "max") cor(rec_data_imputed$age_diag[1:10000], HES_age_example$age_diag[1:10000])
A helper table with disease metadata to support mapping between PheCodes and ICD-10.
disease_info_phecode_icd10disease_info_phecode_icd10
A data frame/tibble. Common columns include:
PheCode as character.
Alternative PheCode column name (if present).
ancestor PheCode range (character).
Human-readable phenotype/label (character), if available.
ancestor PheCode name (character).
head(disease_info_phecode_icd10)head(disease_info_phecode_icd10)
Disease matrix reformatting for ATM
diseasematrix2longdata(disease_matrix)diseasematrix2longdata(disease_matrix)
disease_matrix |
a disease matrix with the first column name "eid", other column are disease names. Disease should be coded as 0,1. |
a data frame which can be feed into wrapper_ATM
disease_matrix <- longdata2diseasematrix(HES_age_example) diseasematrix2longdata(disease_matrix)disease_matrix <- longdata2diseasematrix(HES_age_example) diseasematrix2longdata(disease_matrix)
A realistic sized simulated Hospital Episode Statistics (HES) data with participant IDs and ages at diagnosis, used in examples and tests. You would expect the run time of AgeTopicModels on these data is similar to what you face in real life
HES_age_exampleHES_age_example
A data frame/tibble with example rows. Typical columns include:
Participant identifier (integer or character).
Age at diagnosis (numeric).
ICD-10 diagnosis code (character).
head(HES_age_example)head(HES_age_example)
A realistic sized simulated HES diagnoses with participant IDs and ICD-10 codes.
HES_icd10_exampleHES_icd10_example
A data frame/tibble with example rows. Typical columns include:
Participant identifier (integer or character).
ICD-10 diagnosis code (character).
ICD-10 diagnosis age point (double).
head(HES_icd10_example)head(HES_icd10_example)
Mapping the disease code from icd10 to phecode; the mapping are based on https://phewascatalog.org/phecodes; The input if using ICD-10 should be a string f numbers and capital letters only. For example, "I25.1" should be "I251".
icd2phecode(rec_data)icd2phecode(rec_data)
rec_data |
input data which use ICD10 encoding; please refer to the internal example data |
a data frame where most entries are mapped from ICD10 code to phecode
phecode_data <- icd2phecode(HES_icd10_example)phecode_data <- icd2phecode(HES_icd10_example)
Mapping individuals to fixed topic loadings.
loading2weights(data, ds_list = UKB_349_disease, topics = UKB_HES_10topics)loading2weights(data, ds_list = UKB_349_disease, topics = UKB_HES_10topics)
data |
the set of diseases, formatted same way as HES_age_example |
ds_list |
a list of diseases that correspond to the topic loadings that patients are mapped to formatted as UKB_349_disease; default is set to be UKB_349_disease. |
topics |
The topics that are used to map patients. Default is set to be UKB_HES_10topics, which are the inferred topics from 349 Phecodes from the UK Biobank HES data. Details of these topics are available in the paper "Age-dependent topic modelling of comorbidities in UK Biobank identifies disease subtypes with differential genetic risk". |
a list with two dataframes: the topic_weights dataframe has the first column being the individual id, the other columns are the patient topic weights mapped to the topic loadings; The second dataframe column incidence_weight_sum is eid and the cumulative topic weights across all disease diagnoses.
set.seed(1) new_weights <- loading2weights(HES_age_example[1:1000,])set.seed(1) new_weights <- loading2weights(HES_age_example[1:1000,])
Title
longdata2diseasematrix(rec_data)longdata2diseasematrix(rec_data)
rec_data |
A diagnosis data frame with three columns; format data as HES_age_example; first column is individual ids (eid), second column is the disease code (diag_icd10); third column is the age at diagnosis (age_diag). Note for each individual, we only keep the first onset of each diseases. Therefore, if there are multiple incidences of the same disease within each individual, the rest will be ignored. |
a disease matrix with first column being the individual ids, columns follows are diseases with 0,1 coding.
disease_matrix <- longdata2diseasematrix(HES_age_example)disease_matrix <- longdata2diseasematrix(HES_age_example)
Mapping table between ICD-10 codes and PheCodes.
phecode_icd10phecode_icd10
A data frame/tibble. Common columns include:
ICD-10 code (character).
PheCode (character).
ancestor PheCode range (character).
ancestor PheCode name (character).
head(phecode_icd10)head(phecode_icd10)
Mapping table between ICD-10-CM codes and PheCodes.
phecode_icd10cmphecode_icd10cm
A data frame/tibble. Common columns include:
ICD-10-CM code (character).
PheCode (character).
ancestor PheCode range (character).
ancestor PheCode name (character).
head(phecode_icd10cm)head(phecode_icd10cm)
Title plot the topic loadings across age.
plot_age_topics( disease_names, trajs, plot_title = "", start_age = 30, top_ds = 10 )plot_age_topics( disease_names, trajs, plot_title = "", start_age = 30, top_ds = 10 )
disease_names |
the list of disease names, ordered as the topic. |
trajs |
one disease topic, which should be a matrix of age-by-disease. |
plot_title |
the title of the figure. |
start_age |
starting age of the matrix, default 30. |
top_ds |
How many disease to show, default is 10. This will filter the disease by the average topic laodings across age and pick the top. |
a ggplot object of the topic loading.
disease_list <- UKB_349_disease %>% dplyr::left_join(disease_info_phecode_icd10, by = c("diag_icd10"="phecode" )) %>% dplyr::pull(phenotype) topic_id <- 1 # plot the first topic plot_age_topics(disease_names = disease_list, trajs = UKB_HES_10topics[30:80,,topic_id], plot_title = paste0("topic ", topic_id), top_ds = 7)disease_list <- UKB_349_disease %>% dplyr::left_join(disease_info_phecode_icd10, by = c("diag_icd10"="phecode" )) %>% dplyr::pull(phenotype) topic_id <- 1 # plot the first topic plot_age_topics(disease_names = disease_list, trajs = UKB_HES_10topics[30:80,,topic_id], plot_title = paste0("topic ", topic_id), top_ds = 7)
Title plot topic loadings for LFA.
plot_lfa_topics(disease_names, beta, plot_title = "")plot_lfa_topics(disease_names, beta, plot_title = "")
disease_names |
the list of disease names, ordered as the topic. |
beta |
disease topics, which should be a matrix of K-by-disease. |
plot_title |
the title of the figure. |
a ggplot object of the topic loading.
disease_list <- UKB_349_disease$diag_icd10[1:50] topics <- matrix(rnorm(10*length(UKB_349_disease)), nrow = length(UKB_349_disease), ncol = 10) plot_lfa_topics(disease_names = disease_list, beta = topics, plot_title = "Example noisy topics presentation")disease_list <- UKB_349_disease$diag_icd10[1:50] topics <- matrix(rnorm(10*length(UKB_349_disease)), nrow = length(UKB_349_disease), ncol = 10) plot_lfa_topics(disease_names = disease_list, beta = topics, plot_title = "Example noisy topics presentation")
Title Compute prediction odds ratio for a testing data set using pre-training ATM topic loading. Note only diseases listed in the ds_list will be used. The prediction odds ratio is the odds predicted by ATM versus a naive prediction using disease probability.
prediction_OR(testing_data, ds_list, topic_loadings, max_predict = NULL)prediction_OR(testing_data, ds_list, topic_loadings, max_predict = NULL)
testing_data |
A data set of the same format as HES_age_example; Note: for cross-validation, split the training and testing based on individuals (eid) instead of diagnosis to avoid using training data for testing. Note the test data that has diagnosis age outside the topic loading is disgarded, as we don't recommend extrapolate topic loadings outside the training data. |
ds_list |
The order of disease code that appears in the topic loadings. This is a required input as the testing data could miss some of the records. The first column should be the disease code, second column being the occurrence (to serve as the baseline for prediction odds ratio). See AgeTopicModels::UKB_349_disease as an example. |
topic_loadings |
A three dimension array of topic loading in the format of AgeTopicModels::UKB_HES_10topics; |
max_predict |
The logic of prediction is using 1,..N-1 records to predict the Nth diagnosis; we perform this prediction in turn, starting from using first disease to predict sencond.... for the max_predict^th disease, we will just predict all diseases afterwards, using only 1,..(max_predict-1) diseseas to learn the topic weights; default is set to be 11 (using 1,...10 disease to predict). |
The returned object has four components: OR_top1, OR_top2, OR_top5 is the prediction odds ratio using top 1%, top 2%, or top 5% of ATM predicted diseases as the target set; the fourth component prediction_precision is as list, with first element saves the prediction probability for 1%, 2%, 5% and 10%; additional variables saves the percentile of target disease in the ATM predicted set; for example 0.03 means the target disease ranked at 3% of the diseases ordered by ATM predicted probability.
set.seed(1) testing_data <- HES_age_example %>% dplyr::slice(1:1000) new_output <- prediction_OR(testing_data, ds_list = UKB_349_disease, topic_loadings = UKB_HES_10topics, max_predict = 5)set.seed(1) testing_data <- HES_age_example %>% dplyr::slice(1:1000) new_output <- prediction_OR(testing_data, ds_list = UKB_349_disease, topic_loadings = UKB_HES_10topics, max_predict = 5)
A lookup table mapping ICD-10 codes to concise human-readable labels.
short_icd10short_icd10
A data frame/tibble. Common columns include:
ICD-10 code (character).
phecode of parent node (character).
ancestor PheCode range (character).
ancestor PheCode name (character).
number of distinct patient in UKB
head(short_icd10)head(short_icd10)
A lookup table mapping ICD-10-CM codes to concise human-readable labels.
short_icd10cmshort_icd10cm
A data frame/tibble. Common columns include:
ICD-10 code (character).
phecode of parent node (character).
ancestor PheCode range (character).
ancestor PheCode name (character).
number of distinct patient in UKB
head(short_icd10cm)head(short_icd10cm)
Second step of the two-step simulation. Consumes outputs from
simulate_topics() and generates disease outcomes under several
genetic/topic-effect configurations.
simulate_genetic_disease_from_topic( para, genetics_population, causal_disease, disease_number, ds_per_idv = 6.1, itr_effect = 0, topic2disease = 2, v2t = 20, liability_thre = 0.8 )simulate_genetic_disease_from_topic( para, genetics_population, causal_disease, disease_number, ds_per_idv = 6.1, itr_effect = 0, topic2disease = 2, v2t = 20, liability_thre = 0.8 )
para |
Simulated topic parameters; the first element returned by |
genetics_population |
Simulated genotypes; the second element returned by |
causal_disease |
Simulated causal disease; the third element returned by |
disease_number |
Number of additional diseases to simulate from the topic.
The total number of diseases will be |
ds_per_idv |
Mean number of diseases per individual (default |
itr_effect |
Interaction effect size to simulate (default |
topic2disease |
Topic-to-disease effect size (default |
v2t |
Number of variants that affect topic 1 (must match the value used in |
liability_thre |
Liability threshold for simulating disease: the proportion set to healthy.
For example, |
Five configurations across three SNP sets:
SNP -> disease -> topic: SNP IDs 1-20; disease ID para$D + 1; topic ID 1.
SNP * topic -> disease: SNP IDs 41-60; disease ID para$D + 2; topic ID 1.
SNP -> topic -> disease; SNP -> disease: SNP IDs 21-(20 + v2t); disease ID para$D + 3; topic ID 1.
SNP -> topic -> disease; SNP + SNP^2 -> disease: SNP IDs 21-(20 + v2t); disease ID para$D + 4; topic ID 1.
SNP -> topic + topic^2 -> disease; SNP -> disease: SNP IDs 21-(20 + v2t); disease ID para$D + 5; topic ID 1.
A list with four elements:
rec_data: Simulated disease records (primary output).
ds_list: Auxiliary data objects used in the simulation.
interact_disease: Binary disease outcomes for configuration 2.
pleiotropy_disease: Binary disease outcomes for configuration 3.
set.seed(1) # Minimal, fast example rslts <- simulate_topics(topic_number = 2, pop_sz = 1000, disease2topic = 0.1, v2t = 20) para_sim <- rslts[[1]] genetics_population <- rslts[[2]] causal_disease <- rslts[[3]] reslt_ds <- simulate_genetic_disease_from_topic( para_sim, genetics_population, causal_disease, disease_number = 20, itr_effect = 1, topic2disease = 2, v2t = 20 ) rec_data <- reslt_ds[[1]]set.seed(1) # Minimal, fast example rslts <- simulate_topics(topic_number = 2, pop_sz = 1000, disease2topic = 0.1, v2t = 20) para_sim <- rslts[[1]] genetics_population <- rslts[[2]] causal_disease <- rslts[[3]] reslt_ds <- simulate_genetic_disease_from_topic( para_sim, genetics_population, causal_disease, disease_number = 20, itr_effect = 1, topic2disease = 2, v2t = 20 ) rec_data <- reslt_ds[[1]]
First step of a two-step procedure to simulate a genetic-disease-topic structure. In this step, all genetic effects act on topic 1.
simulate_topics( topic_number, num_snp = 100, pop_sz = 10000, disease2topic = 0, v2t = 20, snp2t = 0.04, snp2d = 0.15, liability_thre = 0.8 )simulate_topics( topic_number, num_snp = 100, pop_sz = 10000, disease2topic = 0, v2t = 20, snp2t = 0.04, snp2d = 0.15, liability_thre = 0.8 )
topic_number |
Number of topics to simulate. |
num_snp |
Total number of SNPs (default 100; must be >= 60). |
pop_sz |
Number of individuals (default 10,000). |
disease2topic |
Disease-to-topic effect size (default 0). |
v2t |
Number of variants affecting topic 1 (0-20; default 20). |
snp2t |
SNP-to-topic effect size (default 0.04; informed by UKB). |
snp2d |
SNP-to-disease effect size (default 0.15). |
liability_thre |
Liability threshold: proportion set to healthy.
For example, |
Five configurations across three SNP sets:
SNP -> disease -> topic: SNP IDs 1-20; disease ID para$D + 1; topic ID 1.
SNP * topic -> disease: SNP IDs 41-60; disease ID para$D + 2; topic ID 1.
SNP -> topic -> disease; SNP -> disease: SNP IDs 21-(20 + v2t); disease ID para$D + 3; topic ID 1.
SNP -> topic -> disease; SNP + SNP^2 -> disease: SNP IDs 21-(20 + v2t); disease ID para$D + 4; topic ID 1.
SNP -> topic + topic^2 -> disease; SNP -> disease: SNP IDs 21-(20 + v2t); disease ID para$D + 5; topic ID 1.
A list of length 3:
para: Topic parameters.
genetics_population: Simulated genotype matrix.
causal_disease: One simulated binary disease caused by loading on topic 1.
simulate_genetic_disease_from_topic()
set.seed(1) disease2topic <- 0 v2t_small <- 20 # Step 1: simulate topics (fast) rslts <- simulate_topics( topic_number = 2, pop_sz = 1000, disease2topic = disease2topic, v2t = v2t_small ) para_sim <- rslts[[1]] genetics_population <- rslts[[2]] causal_disease <- rslts[[3]] # Step 2 (optional): generate diseases from topics reslt_ds <- simulate_genetic_disease_from_topic( para_sim, genetics_population, causal_disease, disease_number = 20, itr_effect = 1, topic2disease = 2, v2t = 20 ) rec_data <- reslt_ds[[1]]set.seed(1) disease2topic <- 0 v2t_small <- 20 # Step 1: simulate topics (fast) rslts <- simulate_topics( topic_number = 2, pop_sz = 1000, disease2topic = disease2topic, v2t = v2t_small ) para_sim <- rslts[[1]] genetics_population <- rslts[[2]] causal_disease <- rslts[[3]] # Step 2 (optional): generate diseases from topics reslt_ds <- simulate_genetic_disease_from_topic( para_sim, genetics_population, causal_disease, disease_number = 20, itr_effect = 1, topic2disease = 2, v2t = 20 ) rec_data <- reslt_ds[[1]]
A small mapping table used by functions such as icd2phecode
SNOMED_ICD10CMSNOMED_ICD10CM
A data frame/tibble. Common columns include:
SNOMED CT concept identifier (character).
ICD-10 code (character), and/or
ICD-10-CM code (character).
SNOMED readable explanation
ICD10 occurence in UKB
head(SNOMED_ICD10CM)head(SNOMED_ICD10CM)
A character vector or table listing the set of disease phenotypes used in examples/vignettes.
UKB_349_diseaseUKB_349_disease
A data frame/tibble containing disease identifiers/names. Columns include:
Phecode (character).
number of distinct patient in UKB
@examples head(UKB_349_disease)
An illustrative result object/table from a 10-topic model fit to UKB HES-like data; used for examples, plotting, and tests.
UKB_HES_10topicsUKB_HES_10topics
An array for UKB topic loadings. Dimention is age, disease, topics. the ordering of disease is the same as UKB_349_disease.
head(UKB_HES_10topics)head(UKB_HES_10topics)
Run ATM on diagnosis data to infer topic loadings and topic weights. Note one run of ATM on 100K individuals would take ~30min (defualt is 5 runs and pick the best fit); if the data set is small and the goal is to infer patient-level topic weights (i.e. assign comorbidity profiles to individuals based on the disedases), please use loading2weights.
wrapper_ATM( rec_data, topic_num = 10, degree_free_num = 3, CVB_num = 5, save_data = FALSE )wrapper_ATM( rec_data, topic_num = 10, degree_free_num = 3, CVB_num = 5, save_data = FALSE )
rec_data |
A diagnosis data frame with three columns; format data as HES_age_example; first column is individual ids (eid), second column is the disease code (diag_icd10); third column is the age at diagnosis (age_diag). Note for each individual, we only keep the first onset of each diseases. Therefore, if there are multiple incidences of the same disease within each individual, the rest will be ignored. If there is no age variation in the third column, LDA (no age information) will be run instead of ATM. |
topic_num |
Number of topics to infer. Default is 10 but we highly recommend running multiple choices of this number. |
degree_free_num |
control the parametric for of topic loadings: Degrees of freedom (d.f.) from 2 to 7 represent linear, quadratic polynomial, cubic polynomial, spline with one knot, spline with two knots, and spline with three knots. Default is set to 3. |
CVB_num |
Number of runs with random initialization. The final output will be the run with highest ELBO value. |
save_data |
A flag which determine whether full model data will be saved. If TRUE, a Results/ folder will be created and full model data will be saved. Default is set to be FALSE. |
Return a list object with topic_loadings (of the best run), topic_weights (of the best run), ELBO_convergence (ELBO until convergence), patient_list (list of eid which correspond to rows of topic_weights), ds_list (gives the ordering of diseases in the topic_loadings object), disease_number (number of total diseases), patient_number(total number of patients), topic_number (total number of topic), topic_configuration (control the parametric for of topic loadings: Degrees of freedom (d.f.) from 2 to 7 represent linear, quadratic polynomial, cubic polynomial, spline with one knot, spline with two knots, and spline with three knots. Default is set to 3.), multiple_run_ELBO_compare (ELBO of each runs).
# minimal, always-run example (tiny data/iterations) set.seed(1) inference_results <- wrapper_ATM(HES_age_example[1:500,], topic_num = 2, CVB_num = 1)# minimal, always-run example (tiny data/iterations) set.seed(1) inference_results <- wrapper_ATM(HES_age_example[1:500,], topic_num = 2, CVB_num = 1)
Run LFA on diagnosis data to infer topic loadings and topic weights. Note one run of LFA on 100K individuals would take ~30min (defualt is 5 runs and pick the best fit); if the data set is small and the goal is to infer patient-level topic weights (i.e. assign comorbidity profiles to individuals based on the disedases), please use loading2weights.
wrapper_LFA( rec_data, topic_num, CVB_num = 5, save_data = FALSE, beta_prior_flag = FALSE, topic_weight_prior = NULL )wrapper_LFA( rec_data, topic_num, CVB_num = 5, save_data = FALSE, beta_prior_flag = FALSE, topic_weight_prior = NULL )
rec_data |
A diagnosis data frame with three columns; format data as HES_age_example; first column is individual ids, second column is the disease code; third column is the age at diagnosis. Note for each individual, we only keep the first onset of each diseases. Therefore, if there are multiple incidences of the same disease within each individual, the rest will be ignored. |
topic_num |
Number of topics to infer. |
CVB_num |
Number of runs with random initialization. The final output will be the run with highest ELBO value. |
save_data |
A flag which determine whether full model data will be saved. If TRUE, a Results/ folder will be created and full model data will be saved. Default is set to be FALSE. |
beta_prior_flag |
A flag if true, will use a beta prior on the topic loading. Default is set to be FALSE. |
topic_weight_prior |
prior of individual topic weights, default is set to be a vector of one (non-informative) |
Return a list object with topic_loadings (of the best run), topic_weights (of the best run), ELBO_convergence (ELBO until convergence), patient_list (list of eid which correspond to rows of topic_weights), ds_list (gives the ordering of diseases in the topic_loadings object), disease_number (number of total diseases), patient_number(total number of patients), topic_number (total number of topic), ,multiple_run_ELBO_compare (ELBO of each runs).
HES_age_small_sample <- HES_age_example[1:100,] inference_results <- wrapper_LFA(HES_age_small_sample, topic_num = 3, CVB_num = 1)HES_age_small_sample <- HES_age_example[1:100,] inference_results <- wrapper_LFA(HES_age_small_sample, topic_num = 3, CVB_num = 1)