Intended for healthcare professionals

Research Methods & Reporting

Process guide for inferential studies using healthcare data from routine clinical practice to evaluate causal effects of drugs (PRINCIPLED): considerations from the FDA Sentinel Innovation Center

BMJ 2024; 384 doi: https://doi.org/10.1136/bmj-2023-076460 (Published 12 February 2024) Cite this as: BMJ 2024;384:e076460
  1. Rishi J Desai, associate professor1,
  2. Shirley V Wang, associate professor1,
  3. Sushama Kattinakere Sreedhara, research scientist1,
  4. Luke Zabotka, research, assistant1,
  5. Farzin Khosrow-Khavar, research fellow1,
  6. Jennifer C Nelson, senior investigator2,
  7. Xu Shi, assistant professor3,
  8. Sengwee Toh, professor4,
  9. Richard Wyss, assistant professor1,
  10. Elisabetta Patorno, associate professor1,
  11. Sarah Dutcher, epidemiologist5,
  12. Jie Li, associate director for RWE5,
  13. Hana Lee, senior staff fellow5,
  14. Robert Ball, deputy director for Office of Surveillance and Epidemiology5,
  15. Gerald Dal Pan, director for Office of Surveillance and Epidemiology5,
  16. Jodi B Segal, professor6,
  17. Samy Suissa, professor7,
  18. Kenneth J Rothman, professor8,
  19. Sander Greenland, professor9,
  20. Miguel A Hernán, professor10,
  21. Patrick J Heagerty, professor11,
  22. Sebastian Schneeweiss, professor1
  1. 1Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02120, USA
  2. 2Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA
  3. 3Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
  4. 4Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA, USA
  5. 5US Food and Drug Administration, Silver Spring, MD, USA
  6. 6Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
  7. 7Departments of Epidemiology and Biostatistics, and Medicine, McGill University, Montreal, QC, Canada
  8. 8Boston University School of Public Health, Boston, MA, USA
  9. 9Department of Epidemiology and Department of Statistics, University of California, Los Angeles, CA, USA
  10. 10CAUSALab and Departments of Epidemiology and Biostatistics, Harvard T H Chan School of Public Health, Boston, MA, USA
  11. 11Department of Biostatistics, University of Washington, Seattle, WA, USA
  1. Correspondence to: R J Desai rdesai{at}bwh.harvard.edu (or @RishiDesai11 on Twitter)
  • Accepted 11 December 2023

This report proposes a stepwise process covering the range of considerations to systematically consider key choices for study design and data analysis for non-interventional studies with the central objective of fostering generation of reliable and reproducible evidence. These steps include (1) formulating a well defined causal question via specification of the target trial protocol; (2) describing the emulation of each component of the target trial protocol and identifying fit-for-purpose data; (3) assessing expected precision and conducting diagnostic evaluations; (4) developing a plan for robustness assessments including deterministic sensitivity analyses, quantitative bias analyses, and net bias evaluation; and (5) inferential analyses.

Non-interventional studies, also referred to as observational studies, are conducted using real world data sources typically including healthcare data that are generated during provision of routine clinical care (including health insurance claims and electronic health records). These studies provide an opportunity to fill in evidence gaps for questions that have not been answered by randomized trials.1 However, generating decision grade evidence from healthcare data requires a robust causal framework to avoid introducing bias. Numerous tools aimed at improving the conduct or reporting of these non-interventional studies are available. Broad guidance documents discuss the methodology for non-interventional studies—such as the best practices for pharmacoepidemiological safety studies by the Food and Drug Administration (FDA)2 and the European Network of Centres for Pharmacoepidemiology and Pharmacovigilance (EncEPP) guide on methodological standards in pharmacoepidemiology.3 Quality assessment tools such as ROBINS-I4 and GRACE checklist5 assist with the evaluation of bias in published studies. Reporting tools such as RECORD-PE6 and STaRT-RWE7 provide checklists or structured templates to facilitate transparency in protocol reporting and reproducibility. Finally, the harmonized protocol template HARPER8 is supported by regulators to improve communication of key study parameters in non-interventional studies, and is deposited with protocol registration websites (eg, the Open Science Foundation’s OSF.io and European Medicines Agency’s ENcEPP.eu).910 While useful for their specific purposes, these tools are not explicitly intended to guide the design and conduct of non-interventional studies that evaluate drug safety and effectiveness using healthcare data.

Other frameworks such as LEGEND11 and the causal roadmap12 outline some broad general principles for evidence generation. However, they provide limited practical guidance on critical aspects of the process of evidence generation, including determining fitness-for-purpose of the data source, registering study protocols, considering principled adaptations over the course of a study, and planning robustness evaluations. To that end, we present a stepwise process covering these key choices with respect to design and analysis that can influence the validity of such studies. We initiate our discussion by considering the FDA Sentinel system, a national, postmarketing active surveillance system for drug products13 using large volumes of healthcare data from insurance claims and electronic health records as a representative use case. The five step process outlined in this report covers formulating a well defined causal question via specification of the target trial protocol; describing the emulation of each component of the target trial protocol and identifying fit-for-purpose data source; assessing expected precision and conducting diagnostic evaluations; developing a plan for robustness assessments including deterministic sensitivity analyses, quantitative bias analyses, and net bias evaluation; and inferential analyses.

Summary points

  • Non-interventional studies (also referred to as observational studies) conducted using healthcare data that are generated during provision of routine clinical care (including health insurance claims and electronic health records) provide an opportunity to fill in evidence gaps for questions not answered by randomized trials

  • Despite several assessment and guideline tools available to evaluate the validity of such non-interventional studies, none proposes a practical guide for the conduct and analysis of these studies

  • PRINCIPLED (process guide for inferential studies using healthcare data from routine clinical practice to evaluate causal effects of drugs) is a stepwise process proposed to systematically consider key choices for study design and data analysis for non-interventional studies

  • The process outlined here can inform the conduct of non-interventional studies, facilitate transparent communications between various stakeholders, and could motivate similar considerations for the clinical research community

Overview of the proposed process guide

PRINCIPLED (process guide for inferential studies using healthcare data from routine clinical practice to evaluate causal effects of drugs) is a five step process to help ask and answer a causal question regarding drug treatment effects using healthcare data. We explicitly differentiate between a study planning phase (steps 1-4) where no inference is made, and a study analysis phase (step 5) where inferential analyses are conducted with the intent to draw causal inferences. Figure 1 shows an overview of the proposed steps. Sections below discuss each of the steps in detail. We illustrate the operationalization of each step through an example of the evaluation of sodium-glucose cotransporter-2 (SGLT-2) inhibitors, drugs used for type 2 diabetes treatment, with respect to the known safety concern of genital infections.14 While this process considers an iterative general approach to resolve issues as they arise during conduct of non-interventional studies, specific situations could necessitate deliberate deviation from these steps. Even in situations where the process cannot be fully implemented, a reasonable study could still be conducted, but certain trade-offs might need to be made.

Fig 1
Fig 1

Overview of the process guide for inferential studies using healthcare data from routine clinical practice

Step 1: Formulate a causal question via specification of the target trial protocol

Asking the right question in the right manner constitutes the first step in any process for causal inference about treatment effects from observed data.1516 A practical way to ask a causal question in non-interventional studies is to specify a protocol of the target trial—the pragmatic trial that would answer the causal question.1718 Among the key elements of the target trial protocol that need to be specified are eligibility criteria, treatment strategies, primary outcome(s) of interest, treatment assignment, start and end of the follow-up, and causal contrast (eg, intention-to-treat or per protocol effect). Precise specification of the target trial protocol is critical because it has direct implications in analysis and interpretation. For instance, specified eligibility criteria determine the population to which the results would apply. Table 1 summarizes the basic target trial protocol for our case example study.

Table 1

Target trial protocol for case example study evaluating the effect of sodium-glucose cotransporter-2 (SGLT-2) inhibitors on genital infections

View this table:

Step 2: Describe the emulation of each component of the target trial protocol and identify a fit-for-purpose data source

Specifying the key components of the target trial protocol in step 1 clarifies a list of the data elements necessary to emulate it. Next, confounders that are necessary to emulate baseline randomization should be identified. Causal diagrams, such as causal directed acyclic graphs, are useful to make decisions about confounder selection when sufficient content knowledge is available.1920 Importantly, adjustment for colliders and instrumental variables should be avoided.21

Once all data elements are outlined, investigators need to describe the emulation of each component of the target trial protocol by providing a precise description of variable definitions, including all codes and algorithms used for eligibility criteria, treatment strategies (including treatment initiation and discontinuation), outcomes, and confounders (step 2a). Data analyses that would be implemented if the data from the target trial were available should also be described in detail. Structured protocol templates such as STaRT-RWE7 and HARPER8 are available to assist with transparent reporting of the study protocol. A design diagram is suggested to summarize visually the longitudinal design aspects of a study.22

Next, investigators need to identify fit-for-purpose data sources that contain all data elements needed for successful emulation of the target trial (step 2b). Target trial specification is an iterative process that depends on the availability of data to support the emulation. If certain data elements are not included in the data source being considered, investigators can consider alternate data sources.

As an example of selection of fit-for-purpose data, we consider the Sentinel system, which contains structured data from health insurance claims representing 844 million person years of observation between 2000 and 2021 across a large network of data providers,23 and is increasingly being enriched with insurance claims and linked data from electronic health records.24Figure 2 outlines an approach to assess the fitness of purpose that is compatible with FDA draft guidance to industry on real world data.25 Two key considerations are data relevance and data reliability. For determination of relevance, we consider the context of Sentinel where most of the data come from insurance claims, and ancillary sources (including electronic health records) provide opportunities for augmentation. In this case, relevance determination depends on a series of questions focused on measurement characteristics of four variable types central to the research question of interest in insurance claims data: eligibility criteria, outcome, treatment, and key confounders. If measurement of any of these variables is deemed to be insufficient, augmentation of insurance claims with alternate sources such as linked electronic health records would be needed. We describe below the specific nuances when considering these four key questions.

Fig 2
Fig 2

Determining fit-for-purpose data sources (step 2b of the process guide for inferential studies using healthcare data from routine clinical practice). HbA1c=glycated hemoglobin; EHR=electronic health records. *Quality=accuracy with respect to timing and completeness for treatments; positive predicted value, sensitivity, specificity for binary outcomes; proportion missing for continuous outcomes; accurate onset for time to event outcomes; and availability of long term follow-up data for latent outcomes

  • Question 1: Can the eligibility criteria be emulated with sufficient accuracy?

    Certain eligibility criteria specified in the target trial protocol (eg, some medical conditions) might not be explicitly identifiable in insurance claims and a previously validated phenotyping algorithm might not be available. In these circumstances, linkage to electronic health records will be needed for development and validation of phenotyping algorithms identifying the health conditions of interest using claims based proxy information.

    For instance, heart failure subtypes of preserved and reduced ejection fraction are not directly identifiable in insurance claims owing to lack of ejection fraction measurements. A probabilistic phenotyping algorithm based on Medicare claims for identifying ejection fraction subtypes for heart failure was developed using Medicare claims linked to electronic health records from the Mass General Brigham healthcare system. It demonstrated overall accuracy of 83% in differentiating between preserved and reduced ejection fraction subtypes.26 This model facilitated deployment of this algorithm in national Medicare claims data to study drug treatment outcomes for these specific populations of interest.2728 In circumstances where a developed algorithm demonstrates suboptimal performance, limiting the analysis to individuals with linked data from insurance claims and electronic health records available and a pre-treatment measurement of the eligibility criteria might be needed to prevent bias at the expense of transportability.

  • Question 2: Is the outcome of interest measured with sufficient quality?

    The quality of outcome measurement depends on positive predicted value for binary outcomes, proportion missing for continuous outcomes, and accurate onset for time-to-event outcomes. Typically, serious medical conditions (eg, stroke) might be adequately recorded in insurance claims29; but other outcomes are not, including those that require confirmatory laboratory test results (eg, acute pancreatitis30) or contextual information from free text notes (eg, suicidal ideation31). For such outcomes, data augmentation through linkage of insurance claims with electronic health records is required.

    Outcome-identifying algorithms (including those using only claims based information) can be developed, improved, and validated based on chart reviews using linked electronic health records. If an algorithm using only claims based information shows acceptable performance, such an algorithm can be applied to the larger insurance claims data source. In cases where claims based algorithms are insufficient but electronic health record sources provide sufficient augmentation to identify the outcome, researchers could consider restricting their population to patients with claims-electronic health records linked records. Judgments on the quality required for an algorithm to be considered sufficient for use in inference can be subjective; however, implementing a simplified rule on performance parameters (eg, ≤85% positive predicted value) might not be helpful. Whether to proceed with the analysis is a multifaceted decision and considers factors such as the urgency of information needed and the severity of the adverse event. Knowing the measurement characteristics through validation in linked electronic health records, even when they are suboptimal, will enable quantitative bias analysis.32 More details on quantitative bias analysis are given below in step 4. In analyses that go across a network of databases, the transportability of measurement algorithms and the measurement qualities across databases might need to be demonstrated.

  • Question 3: Is the treatment measured with sufficient quality?

    Quality of measurement for a particular treatment refers to the accuracy of recording in insurance claims data with respect to the timing and completeness. For many products such as outpatient prescription drug treatments, insurance claims are generally sufficient to capture treatment through outpatient pharmacy dispensing records. However, an example treatment that is often insufficiently recorded in claims is blood transfusion products.33 In such circumstances, alternate data sources that have information on inpatient administrations are needed to answer the research question. If dynamic treatment strategies are being compared, the time-varying clinical factors used to define the strategies over time should also be available.34

  • Question 4: Are key confounders recorded?

    If a strong confounder is not adequately measured in insurance claims, data augmentation with electronic health records or laboratory test results might need to be considered. For example, baseline glycated hemoglobin (HbA1c) test results for a study comparing two glucose-lowering drug treatments with respect to an adverse outcome might require augmentation. Added information on confounders achieved through augmentation might be useful to assess the potential for uncontrolled confounding,35 and for performing additional analyses such as statistical calibration of the study results to incorporate knowledge about unmeasured confounders.36

Data sources meet the basic criteria for relevance, potentially through various augmentation strategies if needed, when they provide explicitly characterized eligibility criteria, primary outcomes, treatment, and key confounders. Additionally, initial feasibility assessment of the number of patients potentially available for the study might be needed to ensure relevance. For example, such assessments could include an initial evaluation of the number of new users of study drug treatments of interest in the data source(s) being considered.

The second aspect for fitness-for-purpose of a data source is data reliability, which includes assessments of accuracy, completeness, provenance, and traceability of the source data (fig 2).25 Within Sentinel, these evaluations are performed upstream when converting raw data from contributing sources to the Sentinel common data model—which is then used for all subsequent analyses.37 Data sources that meet both relevance and reliability criteria can be considered fit for purpose for the study question of interest.

If emulation of each component of the target trial protocol is not feasible with the data source being considered, investigators can reassess the question in step 1 by specifying a modified target trial protocol that requires a different set of data elements while still asking a causal question of interest. Investigators are encouraged to record all assessments of data relevance and data reliability to trace key design decisions leading to selection of fit-for-purpose data that can support the corresponding trial emulation.

If emulation of each component of the target trial protocol is feasible with the data source being considered, investigators should consider registration of the study protocol at this stage before proceeding with assessment of expected precision and diagnostic evaluations (step 3). An alternative to protocol registration is publication of the target trial protocol along with the annotated computer code while making the data available to interested investigators whenever feasible. Pre-registration of protocols and data sharing agreements can serve as deterrent to data dredging, which is a common concern with analyses of healthcare data.38

For the case example study, demographics (age, sex, race, socioeconomic status markers); variables related to diabetes severity including microvascular and macrovascular complications; measures related to diabetes control such as HbA1c, comorbid conditions, co-treatments, markers for healthy behavior, and healthcare use were considered confounders owing to their likely association with treatment choice and outcome risk. We describe the emulation of each component of the target trial protocol by providing a precise description of the operationalization of variable definitions, including all codes and algorithms, using the HARPER8 template (web appendix 2). For statistical analysis, we estimated the hazard ratio (averaged over the follow-up period) via a Cox model adjusted for baseline confounding with propensity score stratification and weighting,3940 as in previous studies with low incidence of treatment initiation and rare safety outcomes.41 Other adjustment methods, such as parametric g formula or inverse probability weighting, might be required when emulating trials with sustained treatment strategies and thus with time-varying treatments.42 We also specified analyses in groups stratified by sex, age, and baseline risk factors for infections as subgroup analyses of interest to evaluate potential effect measure modification by these characteristics.

Appendix figure 1 answers questions 1-4 to provide clarity on likely fit-for-purpose data for our case example. Briefly, outcome and treatment are well captured in Medicare claims; however, linkage to electronic health records could be important to ascertain clinical factors that are used as eligibility criteria or confounders. In this case example, we used US Medicare Fee For Service claims data from parts A, B, D that are deterministically linked by health insurance claim numbers, date of birth, and sex (linkage success rate 99.2%) to electronic health records from the Mass General Brigham healthcare system in Boston.

Step 3: Assess expected precision and conduct diagnostic evaluations

After clearly specifying all design choices and registering a study protocol, the next important design component is assembling the study population using all eligibility criteria to assess expected precision and to conduct diagnostic evaluations. These evaluations could allow for principled study adaptations, yet little formal guidance exists regarding this activity. We fill this gap by outlining a systematic approach in figure 3.

Fig 3
Fig 3

Assessing expected precision and conducting diagnostic evaluations (step 3 of the process guide for inferential studies using healthcare data from routine clinical practice). PS=propensity score

  • Step 3a: Assess expected precision

    For emerging safety signals where effect size is likely not known, the decision to proceed with analyses should depend on the importance of the information gained from a public health perspective.43 However, during the planning phase, it might be helpful to gauge the expected precision based on the selected data source and design choices to determine if adjustments are needed to achieve desired level of precision.44 Based on the outcome counts and sizes of two treatment groups, researchers can estimate the variance of the log risk ratio using well known formulas and assumptions about the magnitude of the risk ratio.44 We provide an R function to estimate expected precision based on sizes of two treatment groups and combined outcome counts across two groups as supplemental material (web appendix 3).

  • Step 3b: Diagnostic evaluations

    Diagnostic evaluations are key components of non-interventional studies because they can alert researchers to potential violations of the core assumptions of causal inference. For instance, examining distribution of baseline characteristics in treatment groups being compared is an important diagnostic to detect positivity violations.45 Evaluating average length of time during which patients adhere to their assigned treatment strategies and examining characteristics of patients who deviate from the treatment strategies could alert researchers to the possibility of informative censoring, which could threaten exchangeability. Other analysis specific diagnostic criteria might also be helpful. For instance, when using analyses based on propensity scores, evaluating baseline covariate balance after conditioning on the propensity score could serve as a diagnostic for model misspecification.404647 If inverse probability weighting is used to adjust for informative censoring or time-varying confounding, evaluating distribution of weights over time could serve as a diagnostic for weight model misspecification.48 For analysis specific diagnostics, refining modelling choices could lead to resolution of issues.

If the assessment indicates lower than desirable precision or diagnostic evaluations indicate violations of core causal inference assumptions that cannot be resolved by refining modelling choices, investigators can consider going back to step 2 and changing some design choices, such as eligibility criteria or choice of the comparator group, before proceeding. This suggestion is analogous to an amendment of the study protocol that is common in prospective randomized trials in response to extraneous factors such as recruiting challenges.49 Similar to the guidance regarding protocol amendments for prospective trials, reasons for changes in the protocol of non-interventional studies using secondary healthcare data need to be clearly documented, as well as any changes in the causal contrasts that result from protocol changes. To maintain analyst blinding with respect to the treatment and outcome association and study integrity, researchers should also ensure that protocol amendments are not introduced in response to inferential analysis (step 5).

For our case example in step 3a, the expected 95% confidence interval under an assumed null effect on the relative scale (1.0) of SGLT-2 inhibitors on the risk of genital infections was 0.35 to 1.65. This result is imprecise because only 1498 patients with only 40 outcomes were eligible for analysis. Because the low sample size is partly due to the inclusion criterion of HbA1c test results before initiation of drug treatment (appendix fig 2), we could go back to step 2 and consider relaxing this inclusion criterion, which would increase the number of eligible individuals to 9339 (293 events) with a 95% confidence interval of 0.73 to 1.27. However, relaxing this criterion makes the assumption that not adjusting for HbA1c in the main analysis does not introduce major confounding bias. Appendix table 1 provides a revised target trial table highlighting the one protocol change prompted by assessment of expected precision.

For step 3b, we used this cohort of 9339 patients meeting eligibility criteria per the amended protocol. We estimated the probability of initiating SGLT-2 inhibitors versus DPP-4 (dipeptidyl peptidase-4) inhibitors given baseline patient characteristics (ie, the propensity score) using multivariable logistic regression models, created 50 stratums based on the distribution of propensity scores in patients receiving SGLT-2 inhibitor treatment, and weighted DPP-4 inhibitor initiators proportional to the distribution of SGLT-2 inhibitor initiators in the propensity score stratum into which they fell.39 As diagnostics for propensity score models, we evaluated distributional overlap (appendix fig 3), weight distribution (appendix fig 4), and covariate balance using standardized differences post-weighting (appendix tables 2 and 3).4050 SAS macros used to conduct the analysis and generate diagnostic figures are publicly available.51 All SAS codes are also posted on https://dev.sentinelsystem.org/projects/IC/repos/ic_ci2_principled/browse.

Step 4: Develop a plan for robustness assessments including deterministic sensitivity analyses, probabilistic sensitivity analyses, and net bias evaluation

Robustness assessments deal with the consistency of evidence with respect to alternative investigator decisions related to study design, measurement, or analysis. As the fourth and final step of study planning, we propose prespecification of robustness assessments. After assessing precision and diagnostic evaluations, investigators probably have additional understanding of the potential threats to the study and can make informed judgments related to the need for specific robustness evaluations. Such prespecified assessments are most useful if they have a clear rationale regarding the specific types of bias they address. Robustness assessments can be broadly categorized into three types, which are detailed below (fig 4).

Fig 4
Fig 4

Robustness evaluations (step 4 of the process guide for inferential studies using healthcare data from routine clinical practice)

  • Step 4a: Deterministic sensitivity analyses

    Deterministic sensitivity analyses, also known as deterministic quantitative bias analysis, can be viewed as variations of the target trial protocol, where investigators focus on specific design or analytical assumptions and vary them individually to gauge the impact of specific assumptions or design choices on study results. Deterministic sensitivity analysis could focus on highly specific design or measurement choices, such as varying the outcome definition to increase the specificity and evaluate the possibility of bias due to outcome misclassification. They could also involve prespecification of alternate statistical analysis methods.

  • Step 4b: Probabilistic sensitivity analyses

    Probabilistic sensitivity analyses, also known as probabilistic quantitative bias analysis, use various probabilistic and simulation approaches to evaluate the impact of various hidden biases on study results, including exposure/outcome misclassification, unmeasured confounders, and selection bias.3552 Monte Carlo simulations evaluating potential bias require realistic ranges for bias parameters, for instance, sensitivity and specificity of an outcome identifying algorithm based on existing information such as validation studies.53 In those simulations, study results are recalculated for each run and then tabulated to provide empirical estimates of expected variation due to uncertainties in exposure or outcome identification.32 Similar bias modelling approaches are available to evaluate the impact of unmeasured confounders on study results based on the strength of association between the exposure and the suspected confounder as well as the outcome and the suspected confounder.35

  • Step 4c: Net bias assessment

    We use the term “net bias assessment” to describe the approaches that allow investigators to detect presence of bias from multiple sources such as uncontrolled confounding, selection bias, and measurement error. We describe two major types of such assessments.

    Firstly, where possible, investigators should a priori identify and include control outcomes or control exposures that are known to have no associations (negative controls) or well established associations (positive controls) with either the exposure or outcome of interest. Ideally, these control variables will have confounding structure or mechanism of measurement error similar to the effect targeted for study.5455 Inability to replicate the known effect sizes in these analyses could alert investigators to the presence of bias.

    Secondly, when a well conducted randomized trial exists for the comparison under investigation with a different primary endpoint or conducted within a more restrictive population, benchmarking or trial calibration might be pursued.5657 If investigators are able to replicate results for the primary outcome of such a trial in their data source by using identical inclusion and exclusion criteria and other design elements, it could increase confidence in results under a modified target trial protocol.

We recommend that investigators add expected precision assessment and diagnostic evaluations along with prespecified robustness assessments as amendments to the registered protocol before moving on to step 5. If assessment of expected precision and diagnostic evaluations, which explicitly do not allow any inferential analyses, lead to any meaningful adaptations in the design or measurement, all such changes should also be documented as amendments to the registered protocol before starting the inferential analyses.

For our case example, we specified a deterministic sensitivity analysis (step 4a) to evaluate the impact of outcome misclassification. We defined the outcome after excluding non-specific codes of balanitis and balanoposthitis in male patients and vaginitis and vulvovaginitis in female patients and focusing solely on candida of urogenital sites.

We also specified a quantitative bias analysis (step 4b). To explore the impact of our assumption that HbA1c is not an important confounder, we used HbA1c data in a subset of patients to inform this analysis.58 Information regarding the distribution of HbA1c in our linked subset and the association between the unmeasured confounder (HbA1c) and outcome (infections) based on prior epidemiological research59 were used as inputs to calculate adjusted estimates over a range of bias parameters.

Finally, we specified a net bias analysis (step 4c), by assessing hospital admission for heart failure as a positive control outcome. SGLT-2 inhibitors have an established association with a reduced risk of hospital admission for heart failure. This association has been observed consistently across randomized controlled trials including CANVAS, CREDENCE, DAPA-HF, DECLARE-TIMI-58, EMPAREG OUTCOME, EMPEROR-REDUCED, and VERTIS-CV.6061 If the set of controlled covariates is sufficient to control confounding (without introducing bias) for both of the outcomes (genital infection and hospital admission for heart failure), a finding of robust adjusted association between the exposure and known positive control outcome can provide some reassurance in the observed findings for the genital infection outcome.

Step 5: Inferential analysis

At the end of step 4, all key design elements, measurements, and data analysis plan are prespecified, and inferential data analysis can proceed. The central idea behind structuring the steps in this sequence with a clear demarcation between planning and inference is to avoid design or analysis changes prompted by study results. At the conclusion of inferential analysis and all prespecified robustness evaluations, investigators are well positioned to make sound inferences about the association under investigation.

For our case example study, results are presented in figure 5, which showed a consistently elevated risk of genital infections after initiating SGLT-2 inhibitors versus DPP-4 inhibitors in patients with diabetes across all subgroups and all robustness evaluations. Appendix figure 5 summarizes the quantitative bias analysis for uncontrolled confounding by HbA1c over a range of bias parameters, which indicated that the risk of genital infections with SGLT-2 inhibitors remained elevated even in extreme scenarios of uncontrolled confounding. In net bias analysis, we observed a robust reduction in the risk of the positive control outcome (hospital admission for heart failure), which was expected. Overall, results indicating potentially a greater risk of genital infections with SGLT-2 inhibitors are in line with prior observations from trials and observational studies. In a large meta-analysis of eight phase 3 randomized trials, the pooled relative risk for genital infections was reported to be 3.75 (95% confidence interval 3.00 to 4.67).62 A previous analysis of US commercial insurance claims reported about a threefold increased risk of genital infections with SGLT-2 inhibitors versus DPP-4 inhibitors.63

Fig 5
Fig 5

Results from the primary analysis, subgroup analyses, and robustness evaluations for the case example study evaluating the effect of sodium-glucose cotransporter-2 (SGLT-2) inhibitors on genital infections. The quantitative bias analysis (QBA) presents adjusted results at the values of bias parameters observed in ancillary data (14% uncontrolled hyperglycemia as defined by glycated hemoglobin (HbA1c) >9% in reference group and odds ratio of 1.3 for receipt of SGLT-2 inhibitor treatment). Appendix figure 5 provides results from this quantitative bias analysis over various combinations of bias parameters

Conclusion

This report introduces a stepwise process that systematically considers key decision nodes for evaluating causal effects of treatments using healthcare data. The process outlined in this framework can facilitate transparent communications between various stakeholders and motivate critical considerations for the clinical research community.

Footnotes

  • Contributors: RJD, SVW, ST, JCN, SS, SD, RB, and GDP have leadership roles in the FDA’s Sentinel initiative, which is the national active postmarketing surveillance system for medical products in the US. All other authors are invited experts from academia or FDA with many years of combined experience in development of methods informing conduct of non-interventional studies. Coauthors from the US Food and Drug Administration (FDA) participated in the results interpretation and in the preparation and decision to submit the manuscript for publication.The authors were brought together as a workgroup supported by the FDA Sentinel Innovation Center. The workgroup held 12 teleconference calls between June 2021 and December 2022, which were attended by authors (RJD, SVW, SKS, LZ, FK-K, JCN, XS, ST, RW, EP, SD, JL, HL, RB, GDP, JBS, SS, KJR, SG, MAH, PJH, and SS) to discuss the process and reach a consensus. RJD, SKS, LZ, and FKK conducted the data analysis for the case example study. RJD is the guarantor of the content of this article. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

  • Funding: This project was supported by Master Agreement 75F40119D10037 from the FDA. The FDA approved the study protocol used in the illustrative example shown in web appendix 2, including statistical analysis plan and reviewed and approved this manuscript. The FDA had no role in data collection, management, or analysis. The views expressed are those of the authors and not necessarily those of the FDA.

  • Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/disclosure-of-interest/ and declare: support from the FDA for the submitted work. RJD reports serving as principal investigator on investigator initiated grants to the Brigham and Women’s Hospital from Novartis, Vertex, and Bayer on unrelated projects. SS is co-principal investigator of an investigator initiated grant to the Brigham and Women’s Hospital from Boehringer Ingelheim unrelated to the topic of this study, and is a consultant to Aetion, a software manufacturer of which he owns equity; his interests were declared, reviewed, and approved by the Brigham and Women’s Hospital and Mass General Brigham HealthCare System in accordance with their institutional compliance policies. RB is an author on US Patent 9 075 796 (on text mining for large medical text datasets and corresponding medical text classification using informative feature selection), which at present is not licensed and does not generate royalties. JCN reports research funding from Moderna for service on their safety monitoring committee.

  • Provenance and peer review: Not commissioned; externally peer reviewed.

References