Predicting health outcomes with intensive longitudinal data collected by mobile health devices: a functional principal component regression approach | BMC Medical Research Methodology

Owing to the ubiquity of smartphones and Bluetooth devices in the consumer market, coupled with fast-developing mobile health technologies, health data have become easily captured, stored, and accessed [1]. This new mode of device data, usually referred to as intensive longitudinal data (ILD) [2], can be measured tens, hundreds, and even thousands of times within a specific time interval, such as hour, day, or month. Compared to traditional clinical measurements at small numbers of discrete clinic visits and panel surveys, mobile health device generated ILD can capture trends in data at a more granular level. This abundance of data availability in near-real time provides tremendous opportunity for disease monitoring, early risk prediction and prevention in healthcare [3]. Specifically, as self-monitoring between clinic visits is essential for managing chronic disease such as type 2 diabetes and hypertension [1, 4], many patients use mobile health devices to collect and self-monitor various health indicators and health behaviors on a daily basis over a long time period. There is an emerging need to use these intensively collected data to support patients with chronic illnesses in managing their conditions between clinic visits.

While a variety of mobile health technologies may facilitate data collection, there are considerable challenges in managing and analyzing the ILD they generate. Specifically, due to singularity issue when the number of repeated measurements is more than the number of participants, standard regression models may not allow coefficients to be estimated uniquely [4]. A simple and traditional way to handle ILD is the response feature approach, in which data are summarized either by a single summary statistic (i.e., mean or median) or several repeated summary statistics over certain time windows, such as averaging measurements by week or month [2]. Then the data can be analyzed using linear model or linear mixed models. However, this approach would result in the loss of information, and there is no clear evidence to support what time interval is meaningful to use for summary statistics. Therefore, a better way to analyze intensive longitudinal data while retaining most of its value is needed.

A prominent feature of ILD is that they are often in a continuous-time nature that can be inherently represented by an underlying curve, a stochastic process, or a function over time. For instance, although a patient with diabetes typically measures their blood glucose level several times a week, the values can exist at any time within the period and can be considered as functional data. Functional data analysis (FDA) is a class of statistical approaches specially designed to represent the data structure (underlying smooth curve) in ILD that summarizes the trend using a small number of variables [5]. Specifically, functional principal component analysis (fPCA), an emerging first-line approach in FDA, has been used to recover individual complex trajectories [6,7,8,9,10,11,12], and to cluster patients based on their distinct trajectory patterns [13,14,15]. Functional regression modelling, which uses functional data as covariates through fPCA, was developed to explore the longitudinal association between ILD and a scalar outcome [16]. While offering a promising statistical tool to extract trend information in ILD for assessing longitudinal association and conducting risk prediction [17], the application of this method in mobile health research is scarce due to the complexity and relative unfamiliarity of FDA.

This paper serves as a timely and practical guide to illustrate the use of the functional regression model in assessing longitudinal relationship between ILD and health outcome, making risk predictions and recovering individual trajectories. We provide a brief introduction to the functional regression model and available statistical software for conducting this analysis. We then provide an illustrative example to demonstrate the functional regression analysis process step by step using data collected from a mobile health study with type II diabetes patients.

Table of Contents

Functional Data Analysis (FDA) and functional regression model

The concept of functional data and the use of functional data analysis for ILD were introduced by Ramsay & Silverman [18, 19]. Although ILD is discretely measured, they can be considered as functional data because the true values are continuous over a time interval and are regulated by an underlying smooth curve or a function. The basic idea for FDA is to extract trend information from the ILD and construct functional curves for each subject using a linear combination of small numbers of functions through a variety of statistical methods and techniques, including basis expansion and roughness penalty. Various dimension reduction methods can then be applied to the functional object with fPCA being one of the most used due to its flexibility. fPCA is an extension of standard principal component analysis [20] in the functional space. While PCA handles multivariate data as discrete observations, which suits cross-sectional data, fPCA models data as a stochastic process which is smooth trajectories other than discrete data points, which is better for longitudinal data [21]. Indeed, this approach is particularly well-suited to our ILD data, as it enables us to model the latent trajectory of blood glucose levels across a specific time frame. Such modeling offers valuable insights into the dynamic relationship between these levels and health outcomes as time progresses. Conceptually speaking, fPCA captures the variations in functional/longitudinal data by using a few functions over time weighted by uncorrelated variables. After the dimension reduction of ILD to a linear combination of a few functional principal components, they could be used as outcome (functional response model) or predictor (scalar-on-function regression model) or both (function-on-function regression model). An excellent review of all types of functional regression model using fPCA is provided in the books for functional data analysis [16, 22,23,24]. In this section, we will focus on using scalar-on-function functional regression model [25] to study the association between ILD and a scalar outcome. The model is formulated as

$$Y_i = \alpha + \int X_i \left( t \right)\beta \left( t \right)dt + \epsilon _i,$$

(1)

where $\alpha$ is the intercept, $\beta \left(t\right)$ is the coefficient function of time t, which indicates level of importance of each measurement over time with respect to scalar outcome $Y$, and $\epsilon_i$ is the random error that follows the distribution of $N(0, \sigma ^2)$, $i=1,\dots , n$. The biggest difference compared to regular linear regression is that both the regressor $X_i\left(t\right)$ and coefficient function $\beta \left(t\right)$ are functions of time t. There are different ways to obtain unique estimation for $\beta \left(t\right)$ and fPCA-based method is the most commonly used one. The estimation process is conducted in two stages.

In the first stage, we need to represent intensively measured longitudinal data by smooth random functions $X_i\left(t\right)$. The fPCA approach models the data as smooth covariance functions with respect to different time points. The dimension of ILD is usually large given the large number of time points, and the correlations between these repeated measurements are high. fPCA uses Karhunen–Loève expansion to abstract orthogonal functions which represent the most prominent trends in variation of data. For the i^th person, assume that the ILD have been centered [16, 26,27,28,29], then the underlying trajectory $X_i\left(t\right)$ can be approximated by

$$X_i\left(t\right)\approx \sum _j=1^p\widehat\zeta _ij\widehat\upsilon _j\left(t\right),$$

(2)

where $\widehat\upsilon _j\left(t\right)$ is the j^th estimated eigenfunction or estimated functional principal component (EFPC) of the covariance function of $X\left(t\right)$ among top $p$ important EFPCs, and $\widehat\zeta _ij$ is the corresponding j^th estimated random score of i^th person, which is assumed to follow an independent and identically distributed (i.i.d.) normal distribution. The first component $\upsilon _1\left(t\right)$ represents the most significant trend deviated from the mean function since it explains the largest portion of variance. The score $\zeta _ij$ associated with each component describes how much $\upsilon _j\left(t\right)$ contributes to the i^th person’s subject-specific deviation from population mean function. Throughout the paper, the hat over a parameter indicates the parameter or function estimate.

After representing $X_i\left(t\right)$ as a few principal components, in the second stage, we can proceed to the regression model part. It is assumed that the coefficient function $\beta \left(t\right)$ in Eq. (1) can be expanded by eigenfunctions such that

$$\beta \left(t\right)=\sum _j=1^p\beta _j\upsilon _j\left(t\right)$$

(3)

Replacing $X_i\left(t\right)$ by a set of smooth curves according to (2), the regression model in Eq. (1) becomes a regular linear regression model shown as below

$$Y_i = \alpha + \int \beta \left( t \right)\left( \sum\limits_j = 1^p \hat \zeta _ij \hat \upsilon _j\left( t \right) \right)dt + \epsilon _i,$$

$$= \alpha + \sum\limits_j = 1^p \hat \zeta _ij \beta _j + \epsilon _i,$$

(4)

where $\widehat\zeta _ij$ is the functional score that was estimated from (2) and can be treated as the pseudo-covariates after dimension reduction. $\alpha$ is the intercept and $\beta _j=\int \beta \left(t\right)\widehat\upsilon _j\left(t\right)dt$ is the estimated coefficient for the j^th component. Similar to a regular linear regression, we can obtain estimated intercept $\widehat\alpha $ and coefficient for each component $\widehat\beta _j$by least square estimates. We then use the estimated coefficients $\widehat\beta _j$ in Eq. (3) to compute the original coefficient function $\widehat\beta \left(t\right)$ as follows:

$$\widehat\beta \left(t\right)=\sum _j=1^p\widehat\beta _j\widehat\upsilon _j\left(t\right)$$

(5)

More detailed modeling and estimation steps can be found in the supplemental materials.

Commonly used estimation methods for fPCA include smoothing or imputation approaches [5]. Missing data can be handled by either removing records that containing missing values or apply missing data imputation. When there is a large amount of missing data, or when the repeated measures are noisy or at irregular time points, fPCA for sparse functional data can be used. This method can borrow information across samples and produce a more stable and accurate estimation [30, 31].

Several statistical software is readily available for FDA. The R and MATLAB package “fda” [26] were as first developed to implement basic tools of FDA, and the “refund” R package [32] was built to provide more flexible and advanced functional models like various functional regression models. In addition, the “face” package [33] was specially designed to conduct fPCA for sparse functional data or longitudinal data. Recently, the R package “mfaces” [34] was developed to advance multivariate fPCA for multiple sparse functional data. In our illustrative example, we will implement the fPCA using the “face” package in R.

An empirical example: functional PCA regression model using intensive mobile health data

As an illustrative example, we built a scalar-on-function regression model using data from an observational study that was designed to explore the feasibility of using multiple mobile health devices to facilitate patients’ self-management for their type 2 diabetes mellitus [35]. While blood glucose is an important measure for day-to-day management, HbA1c reflects the average blood glucose levels over the past 2–3 months, offering a more stable and comprehensive view of blood sugar control. Furthermore, HbA1c is the only measure of glycemia that has been studied as a means to predict long-term microvascular and macrovascular diabetes complications. Thus, HbA1c remains the single most important glycemic measure for providers and patients alike. Although Hemoglobin A1c (HbA1c) is the main health indicator for type 2 diabetes mellitus patients, patients usually need to visit clinics and have HbA1c checked in a lab every 3–6 months [36]. Between clinic visits, patients were asked to monitor their blood glucose using a glucometer at least on a weekly basis. While there is a suggested controlled range for blood glucose, blood glucose does fluctuate widely based on time of measurement, diet, and other factors [37]. Although a calculator is available to convert average blood sugar to HbA1c, patients may find it challenging to calculate their average blood sugar accurately. According to a recently conducted qualitative research study, patients expressed prefererence for receiving projections of their HbA1c every time they input the self-measured blood glucose measures from a glucometer [38]. Ideally, it would be more convenient to develop a prediction model that could be incorporated in the mobile device to predict HbA1c based on all the input glucometer readings for patients. Additionally, we know that HbA1c reflects red blood cell turnover, which typically occurs every 3–4 months. However, there are no studies that explore the actual longitudinal relationship between blood glucose and HbA1c. Our hypothesis is that HbA1c should disproportionately reflect blood glucose measures from more recent days. In this example, we will demonstrate how to build a scalar-on-function regression model to explore the longitudinal relationship between intensively measured blood glucose over three months and the health outcome HbA1c, predict HbA1c, and showcase the ability of fPCA to recover a smooth curve underlying the intensively measured glucose data over three months for each individual.

Design

The parent study was a single-arm longitudinal observational study. Each patient was provided with a cellular-enabled scale and a smartphone-tethered wrist-worn activity tracker and glucometer. Daily self-measurements of weight, physical activity, and blood glucose data were collected over 6 months [35, 39]. Data were aggregated on a research platform.

Study participants

Sixty adult patients with were recruited from the Duke Family Medicine Center. Participants who were eligible were at least 18 years old, able to speak and read English, diagnosed with type 2 diabetes mellitus, prescribed to monitor their blood sugar at least weekly, on diabetes-related medication, and owned an Android or iOS smartphone.

link

Predicting health outcomes with intensive longitudinal data collected by mobile health devices: a functional principal component regression approach | BMC Medical Research Methodology

Functional Data Analysis (FDA) and functional regression model

An empirical example: functional PCA regression model using intensive mobile health data

Design

Study participants

More Stories

A prediction study on the occurrence risk of heart disease in older hypertensive patients based on machine learning | BMC Geriatrics

A Cardiovascular Disease Prediction Model Based on Routine Physical Examination Indicators Using Machine Learning Methods: A Cohort Study

Harnessing AI for Better Health: How Predictive Analytics is Shaping the Future of Healthcare

Leave a Reply Cancel reply

Shaping the future of healthcare analytics: Nidhi Shashikumar’s data-driven Impact

Healthcare Data Analytics Market Size Analysis by Application,

NEC Orchestrating Future Fund invests in Aetion, provider of healthcare analytics: Press Releases

A ‘transformative experience’ building data skills and lasting relationships

Functional Data Analysis (FDA) and functional regression model

An empirical example: functional PCA regression model using intensive mobile health data

Design

Study participants

More Stories

A prediction study on the occurrence risk of heart disease in older hypertensive patients based on machine learning | BMC Geriatrics

A Cardiovascular Disease Prediction Model Based on Routine Physical Examination Indicators Using Machine Learning Methods: A Cohort Study

Harnessing AI for Better Health: How Predictive Analytics is Shaping the Future of Healthcare

Leave a Reply Cancel reply

You may have missed

Shaping the future of healthcare analytics: Nidhi Shashikumar’s data-driven Impact

Healthcare Data Analytics Market Size Analysis by Application,

NEC Orchestrating Future Fund invests in Aetion, provider of healthcare analytics: Press Releases

A ‘transformative experience’ building data skills and lasting relationships