Decision rules for identifying combination therapies in open-entry, randomized controlled platform trials
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Decision rules for identifying combination therapies in open-entry, randomized controlled platform trials Elias Laurin Meyer1 , Peter Mesenbrink2 , Cornelia Dunger-Baldauf3 , Ekkehard Glimm3 , Yuhan Li2 , and Franz König1,* on behalf of EU-PEARL (EU Patient-cEntric clinicAl tRial pLatforms) Consortium arXiv:2103.09547v1 [stat.AP] 17 Mar 2021 1 Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Austria 2 Novartis Pharmaceuticals Corporation, One Health Plaza, East Hanover, NJ, USA 3 Novartis Pharma AG, Basel, Switzerland * Correspondence: franz.koenig@meduniwien.ac.at; Tel.: +43-1-40400-74800 Abstract The design and conduct of platform trials have become increasingly popular for drug development programs, attracting interest from statisticians, clinicians and regulatory agencies. Many statistical questions related to designing platform trials - such as the impact of decision rules, sharing of information across cohorts, and allocation ratios on operating characteristics and error rates - remain unanswered. In many platform trials, the definition of error rates is not straightforward as classical error rate concepts are not applicable. In particular, the strict control of the family-wise Type I error rate often seems unreason- ably rigid. For an open-entry, exploratory platform trial design comparing combination therapies to the respective monotherapies and standard-of-care, we define a set of error rates and operating characteristics and then use these to compare a set of design parameters under a range of simulation assumptions. When setting up the simulations, we aimed for realistic trial trajectories, e.g. in case one compound is found to be superior to standard-of-care, it could become the new standard-of-care in future cohorts. Our results indicate that the method of data sharing, exact specification of decision rules and quality of the biomarker used to make interim decisions all strongly contribute to the operating characteristics of the platform trial. Together with the potential flexibility and complexity of a platform trial, which also impact the achieved operating characteristics, this implies that utmost care needs to be given to evaluation of different assumptions and design parameters at the design stage. 1 Introduction The goal to test as many investigational treatments as possible over the shortest duration, which is influenced by both recent advances in drug discovery and biotechnology and the ongoing global pandemic due to the SARS-CoV-2 virus [1–3], has made master protocols and especially platform trials an increasingly possible alternative solution to the time-consuming sequences of classical randomized controlled trials [4–8]. These trial designs allow for the evaluation of one or more investigational treatments in the study population(s) of interest within the same clinical trial, as compared to traditional randomized controlled trials, which usually evaluate only one investigational treatment in one study population. When cohorts share key inclusion/exclusion criteria, trial data can easily be shared across such sub-studies. In practice, setting up a platform trial potentially may require additional time due to operational and statistical challenges. However, simulations have shown that platform trials can be superior to classical trial designs with respect to various operating characteristics which include the overall study duration. In a setting where only few new agents are expected to be superior to standard of care, Saville and Berry [9] investigated the operating characteristics of adaptive Bayesian platform trials using binary endpoints compared with a sequence of “traditional” trials, i.e. trials testing only one hypothesis, and found that platform trials perform dramatically better in terms of number 1
of patients and time required until the first superior experimental treatment has been identified. Using real data from the 2013-2016 Ebola virus disease epidemic in West Africa, Brueckner et al. [10] investigated the operating characteristics of various multi-arm multi-stage and two-arm single stage designs, as well as group-sequential two-arm designs, and found that designs with frequent interim analyses outperformed single-stage designs with respect to average duration and sample size when fixing the type 1 error and power. When having a pool of experimental agents available, which should all be tested against a common control, and using progression-free survival as the efficacy endpoint, Yuan et al. [11] found that average trial duration and average sample size are drastically reduced when using a multi-arm, Bayesian adaptive platform trial design using response-adaptive randomization compared with traditional two-arm trials evaluating one agent at a time. Hobbs et al. [12] reached a similar conclusion when comparing a platform trial with a binary endpoint and futility monitoring based on Bayesian predictive probabilities with a sequence of two-arm trials. Tang et al. [13] investigated a phase II setting in which several monotherapies are combined with several backbone therapies and tested in a single-arm manner. Assuming different treatment combination effects, they found that their proposed Bayesian platform design with adaptive shrinkage has a lower average sample size in the majority of scenarios investigated (with compound-only effects as the only exception) and always a higher percentage of correct combination selections when compared with a fully Bayesian hierarchical model and a sequence of Simon’s two-stage designs. Ventz et al. [14] proposed a frequentist adaptive platform (so called ”rolling-arms design”) design as an alternative to sequences of two-arm designs and Bayesian adaptive platform designs, which is much simpler than the Bayesian adaptive platform designs in that it uses equal allocation ratios and simpler and established decision rules based on group sequential analysis. The authors found that performance under different treatment effect assumptions and a set of general assumptions was comparable to, if not slightly better than Bayesian adaptive platform designs and much better than a sequence of traditional two-arm designs in terms of average sample size and study duration. For a comprehensive review on the evolution of master protocol clinical trials and the differentiation between basket, umbrella and platform trials, see Meyer et al. [8]. In this article we explore the impact of both decision rules and assumptions on the nature of the treatment effects and availability of treatments on certain operating characteristics of an open-entry, cohort platform trial with some common study arms. The article is organized as follows: In Section 2 we describe the trial design under investigation, the different testing strategies as well as the investigated operating characteristics. In Section 3, we discuss the simulation setup, the treatment effect scenarios investigated and the exact decision rules used. In Section 4, we present and discuss the results of the different simulation scenarios. We conclude with a discussion in Section 5. 2 Methods 2.1 Platform Design We investigated an open-entry, exploratory cohort platform study design with a binary endpoint evaluating the efficacy of a two-compound combination therapy compared to the respective monotherapies and the standard-of-care (SoC). After an initial inclusion of one or more cohorts, we allow new cohorts to enter the platform trial over time until a maximum number of cohorts is reached. Each cohort consists of up to four arms: combination therapy, monotherapy A, monotherapy B and SoC, whereby the SoC arm is optional. Monotherapy A is the same for all cohorts (further referred to as ”backbone monotherapy”), while monotherapy B (further referred to as ”add-on monotherapy X”) is different in every cohort X. The combination of monotherapies A and B is called combination therapy. See Figure 1 for a schematic overview of the proposed trial design. To show that the backbone and SoC treatments are the same we use the same color coding of grey and green for all cohorts, respectively. The x-axis shows the calendar time. At any point in time, new cohorts could enter the platform trial. We conduct one interim analysis for every cohort (indicated by the vertical yellow line in Figure 1), on the basis of which the cohort might be stopped early for either futility or efficacy. We assume the short-term interim endpoint to be a binary surrogate of the (also binary, but different) final endpoint, whereby the two endpoints are linked via a certain sensitivity and specificity (for more details, see section 4). The platform trial ends if either any of the platform stopping rules is reached (which could be based on either a total sample size or a particular number of successful combination treatments) or if there is no active, recruiting cohort left. 2
2.2 Testing Strategies and Definition of Trial Success We ultimately seek regulatory approval of the combination therapy. Following FDA and EMA regulatory guidelines [15, 16], superiority of the combination therapy over both monotherapies and superiority of both monotherapies over SoC needs to be shown. Depending on the level of prior study information available, i.e. whether or not the superiority of the monotherapies over SoC has already been shown, we distinguish between three testing strategies. In the first testing strategy, we assume superiority over SoC has been shown for both monotherapies, therefore we are only interested in testing the combination therapy against both monotherapies. In the second testing strategy, we assume superiority of the backbone monotherapy over SoC has been shown, but not for the add-on monotherapy. Therefore, compared to the first testing strategy, we additionally test add-on monotherapy against SoC, resulting in three comparisons. In the third and last testing strategy, we do not assume any superiority has been shown yet. Therefore, compared to the second testing strategy, we additionally test backbone monotherapy against SoC, resulting in four comparisons. The third testing strategy is therefore the most rigorous interpretation of the current guidelines. See Figure 1 for a schematic overview of the proposed testing strategies. Related to the regulatory guidelines mentioned above, if we were running a single, independent trial investigating a combination therapy in any of the above mentioned testing strategies, we would consider the trial a success if all of the necessary pair-wise comparison were successful. Consequently, we would consider it a failure if at least one of the necessary pair-wise comparison was unsuccessful. For clearer understanding, we call these two options respectively a positive or negative outcome. Depending on the formulated hypotheses (which might be done via a target product profile), these might be either true or false positives or negatives. As an example, if the target product profile requires the response rate of drug x to be better than the response rate of drug y by a margin of ζ, then if and only if the response rate of drug x is truly better than the response rate drug y by a margin of at least ζ do we consider the alternative hypothesis to hold, thereby making the resulting decision either a true positive or false negative. Reversely, if the response rate drug x is not better than the response rate of drug y by a margin of at least ζ, we consider the null hypothesis to hold, thereby making the resulting decision either a false positive or true negative (please note that usually ζ = 0). In appendix B, more light is shed on the relationship between ζ and the Bayesian decision rules used in section 3.3. Only if all required pair-wise comparisons of a certain testing strategy are met (true or false positives), efficacy of the combination therapy in this cohort has been demonstrated and we declare the cohort successful. If all required pair-wise comparisons are true positives, the cohort success is a true positive, otherwise a false positive. Analogously, the final outcome of a cohort can be a true or false negative. While for a single, independent trial this would yield a single outcome (true positive, false positive, true negative, false negative), for a platform trial with multiple cohorts this yields a vector of such outcomes, one for each investigated cohort. 3
Interim analysis using surrogate Final analysis Combination 1 Add-on Monotherapy 1 Cohort 1 Backbone Monotherapy Standard of Care Interim analysis using surrogate Final analysis Combination 2 Add-on Monotherapy 2 Cohort 2 Backbone Monotherapy Standard of Care Interim analysis using surrogate Final analysis 4 Combination 3 Add-on Monotherapy 3 Cohort 3 Backbone Monotherapy Standard of Care 1 2 3 Time Testing strategies Figure 1: Schematic overview of the proposed platform trial design. New cohorts consisting of a combination therapy arm, a monotherapy arm using the same compound in every cohort (referred to as ”backbone monotherapy”), an add-on monotherapy arm which is different in every cohort and an optional SoC arm are entering the platform over time. While the add-on monotherapy and therefore the combination therapy is different in every cohort (as indicated by the differently shaded colors), the backbone monotherapy and optional SoC are the same in every cohort (as indicated by the same colors). Each cohort has an interim analysis after about half of the initially planned sample size, after which the cohort can be stopped for early efficacy or futility. We allowed three different testing strategies, which differ by the number of monotherapies tested against SoC (none, add-on monotherapy only or both monotherapies).
Interim Analysis Cohort 1 Cohort 2 1:1:1:1 Cohort Cohort 3 Interim Analysis Cohort 1 Cohort 2 k:k:1:1 All Cohort 3 Interim Analysis Cohort 1 Cohort 2 k:k:1:1 Concurrent Cohort 3 5 Interim Analysis Cohort 1 Cohort 2 k:k:1:1 Dynamic Cohort 3 Allocation Type of Ratio Sharing Time Figure 2: Schematic overview of the different levels of sharing. No sharing happens if only ”cohort” data are used. If sharing ”all” data, whenever in any cohort an interim or final analysis is performed, all SoC and backbone monotherapy data available from all cohorts are used. If sharing only ”concurrent” data, whenever in any cohort an interim or final analysis is performed, all SoC and backbone monotherapy data that was collected during the active enrollment time of the cohort under investigation are used. If sharing ”dynamically”, whenever in any cohort an interim or final analysis is performed, the degree of data sharing of SoC and backbone monotherapy from other cohorts increases with the homogeneity of the treatment efficacy of the respective arms. A solid fill represents using data 1-to-1, while a dashed fill represents using discounted data (for more information see appendix A). If at any given time there are k active cohorts, the allocation ratio is 1 : 1 : 1(: 1) in case of no data sharing and k : k : 1(: 1) otherwise (combination : add-on monotherapy : backbone monotherapy : SoC).This allocation ratio is updated for all active cohorts every time the number of active cohorts k changes due to dropping or adding a new cohort. The last number in brackets represents the optional SoC arm.
2.3 Decision Rules and Data Sharing As we discussed in section 2.2, for every testing strategy a certain number of comparisons is conducted. In this paper, for every one of the comparisons (e.g. combination therapy versus add-on mono therapy), we consider simple Bayesian rules based on the posterior distributions of the response rates of the respective study arms [17]. In principle, any other Bayesian (e.g. predictive probabilities, hierarchical models, etc.) or frequentist rules (based on p-values, point estimates or confidence intervals) could be used. While these decision rules are based on fundamentally different paradigms, they might translate into the exact same stopping rules, e.g. with respect to the observed responder rate [18, 19]. Decision rules based on posterior distributions used vague, independent Beta(1/2, 1/2) priors for all simulation results presented in this paper; however, please note that independence of the vague priors is a strong assumption and its violation can lead to selection bias as pointed out by many authors [20–22]. Generally, we allow decision rules for declaring efficacy and decision rules for declaring futility. In order to declare efficacy, all efficacy decision rules must be simultaneously fulfilled. In order to declare futility, it is enough if any of the futility decision rules is fulfilled. It is possible that neither the futility nor the efficacy thresholds are satisfied. In case this happens at the final analysis, we declare the combination therapy unsuccessful, but due to not reaching the superiority criterion at the maximum sample size, and not due to reaching the futility criterion. While this is only a technical difference, this information should be available when conducting the simulations, as this might also influence the future of a drug development program. With the flexible decision rules, we can specify whether we want to share information on the backbone monotherapy and SoC arms across the study cohorts. Several different methods have been proposed to facilitate adequate borrowing of non-concurrent (these can be internal or external to the trial) [23–27]. For simulation purposes we allow four options, all applying to both SoC and backbone: 1) no sharing, using only data from the current cohort (see first row of Figure 2), 2) full sharing of all available data, i.e. using all data 1-to-1 (see second row of Figure 2), 3) only sharing of concurrent data, i.e. using concurrent data 1-to-1 (see third row of Figure 2) and 4) using a dynamic borrowing approach further described in the appendix, in which the degree of shared data increases with the homogeneity of the treatment efficacy, i.e. discounting the data of other cohorts less, if the treatment efficacy is similar (see fourth row of Figure 2). For a detailed formulation of the exact decision rules used, see section 3.3. 2.4 Allocation Ratios Whenever a new cohort enters the platform trial, a key design element is the allocation ratio to the newly added arms (combination therapy, add-on monotherapy, backbone monotherapy and SoC) and whether the allocation ratio of the already ongoing cohorts should be changed as well, e.g. randomizing less patients to backbone monotherapy and SoC in case this data is shared across cohorts. The platform trial advances dynamically and as a result the structure can follow many different trajectories (in some settings we are unsure how many cohorts will enter the platform, how many of them will run concurrently, whether the generated data will stem from the same underlying distributions and should therefore be shared, etc.). As the best possible compromise under uncertainty, we aimed to achieve a balanced randomization for every comparison in case of either no data sharing or sharing only concurrent data. Depending on the type of data sharing and the number of active arms (SoC is optional for every cohort), this means either a balanced randomization ratio within each cohort in case of no data sharing (i.e. 1:1:1(:1), combination : add-on mono (monotherapy B) : backbone mono (monotherapy A) : SoC; the last number in brackets represents the optional SoC arm), or a randomization ratio that allocates more patients to the combination and add-on monotherapy arm for every additional active cohort in case of using only concurrent data. As an example, if at any point in time k cohorts are active at the same time and we share only concurrent data, the randomization ratio is k : k : 1(: 1) in all cohorts, which ensures an equal number of patients per arm for every comparison. This allocation ratio is updated for all active cohorts every time the number of active cohorts k changes due to dropping or adding a new cohort. As an example, assume 30:30:30:30 patients have been enrolled in cohort 1 before a second cohort is added. Then, until e.g. an interim analysis is performed in cohort 1, both cohorts will have an allocation ratio of 2:2:1:1. If the interim analysis is scheduled after 200 patients per cohort, this would mean another 20:20:10:10 patients need to be enrolled in cohorts 1 and 2, since we can use the 10 concurrently enrolled backbone monotherapy and 10 concurrently enrolled SoC patients in cohort 2 for the interim analysis in cohort 1, leading to a balanced 50:50:50:50 patients for the comparisons. In case of either full sharing or dynamic borrowing, we use the same approach as when using concurrent data only. 6
2.5 Operating Characteristics In order to evaluate different trial designs, operating characteristics need to be chosen that take into account the special features of the trial design, but are at the same time interpretable in the classical context of hypothesis testing. For platform trials, the choice of operating characteristics is not obvious [28–32]. Furthermore, for the particular trial design under consideration, many operating characteristics based on the pair-wise comparisons of the different monotherapies, SoC and combination therapy could be considered, which would be complicated further by the flexible options regarding the inclusion of a SoC arm in the cohorts. As the ultimate goal would be to receive approval for the combination treatment, we decided to formulate the operating characteristics on the cohort level (i.e. one true/false positive/negative decision per cohort, as explained in section 2.2). An overview of all operating characteristics used in this article and their definitions can be found in table 1. Specifically, we focus on the following error rates: • Per-cohort type 1 error (the probability of a false positive decision for any new cohort entering the trial; PCT1ER) • Per-cohort power (the probability of a true positive decision for any new cohort entering the trial; PCP) • Family-wise type 1 error rate (probability of at least one false positive decision in the platform trial, i.e. across all cohorts), both corrected for platform trials without true null hypotheses (i.e. trials in which all monotherapies are superior to SoC and all combination therapies are superior to the respective monotherapies (assessed by a certain target product profile)) (FWER) and uncorrected (FWER BA, ”Bayesian Average”) • Disjunctive power (probability of at least one true positive decision in the platform trial) both corrected for platform trials without true alternative hypotheses (Disj Power) and uncorrected (Disj Power BA, ”Bayesian Average”) • False-discovery rate (probability that a positive is a false positive; FDR) Table 1: Operating characteristics used in this paper and their definitions Name Definition Avg Pat Average number of patients per platform trial Avg Suc Hist Average number of responders to treatment per platform trial Avg Cohorts Average number of cohorts per platform trial Avg Pat per Average number of patients per cohort in a platform trial (Avg Pat Cohorts divided by Avg Cohorts) Avg Perc Pats Sup Average percentage of patients on arms superior to SoC per platform SoC trial Avg Pat SoC First Average number of patients on SoC arms until the first cohort was Suc declared successful Avg Coh First Suc Average number of cohorts until the first cohort was declared successful “False Discovery Rate”, the ratio of the sum of false positives (i.e. for how many cohorts, which are in truth futile according to the defined target product profile, did the decision rules lead to a declaration of FDR superiority) among the sum of all positives (i.e. for how many cohorts did the decision rules lead to a declaration of superiority) across all trial simulations 7
“Per-Cohort-Power”, the ratio of the sum of true positives (i.e. for how many cohorts, which are in truth superior according to the defined target product profile, did the decision rules lead to a declaration of PCP superiority) among the sum of all cohorts, which are in truth superior (i.e. the sum of true positives and false negatives) across all platform trial simulations, i.e. this is a measure of how wasteful the trial is with (in truth) superior therapies “Per-Cohort-Type-1-Error”, the ratio of the sum of false positives (i.e. for how many cohorts, which are in truth futile according to the defined target product profile, did the decision rules lead to a declaration of PCT1ER superiority) among the sum of all cohorts, which are in truth futile (i.e. the sum of false positives and true negatives) across all platform trial simulations, i.e. this is a measure of how sensitive the trial is in detecting futile therapies The proportion of platform trials, in which at least one false positive decision has been made, where only such trials are considered, which contain P at least one cohort that∗is in truth futile. Formal definition: ∗ 1{F Pi > 0}, where I0 = {i ∈ {1, ...iter} : ni 1 H0 FWER ∗ |I0 | i∈I > 0}, iter is 0 the number of platform trial simulation iterations, F Pi denotes the number of false-positive decisions in simulated platform trial i and nH0 i is the number of inefficacious cohorts in platform trial i. The proportion of platform trials, in which at least one false positive decision has been made, regardless of whether or not any cohorts which are in truth futile exist in these trials. Formal definition: i=1 1{F Pi > 0}, where iter is the number of platform trial 1 Piter iter simulation iterations and F Pi denotes the number of false-positive FWER BA decisions in simulated platform trial i. This will differ from FWER in scenarios where - due to a prior on the treatment effect - in some simulation runs, there are by chance no inefficacious cohorts in the platform trial (see section 3.2 for more details on the different treatment efficacy scenarios). The proportion of platform trials, in which at least one correct positive decision has been made, where only such trials are considered, which contain P at least one cohort that∗is in truth superior. Formal definition: Disj Power 1 ∗ |I1 | ∗ i∈I1 1 {T Pi > 0}, where I1 = {i ∈ {1, ...iter} : n H1 i > 0}, iter is the number of platform trial simulation iterations, T Pi denotes the number of true-positive decisions in simulated platform trial i and nH1 i is the number of efficacious cohorts in platform trial i. The proportion of platform trials, in which at least one correct positive decision has been made, regardless of whether or not any cohorts which are in truth superior exist in these trials. Formal definition: 1{T Pi > 0}, where iter is the number of platform trial 1 Piter iter i=1 simulation iterations and T Pi denotes the number of true-positive Disj Power BA decisions in simulated platform trial i. This will differ from FWER in scenarios where - due to a prior on the treatment effect - in some simulation runs, there are by chance no efficacious cohorts in the platform trial (see section 3.2 for more details on the different treatment efficacy scenarios). 8
3 Simulations 3.1 Design Parameters To reduce simulation complexity, we fix some values for all simulations, such as allocation ratios (see section 2.4), target product profile (see section 2.2), interim sample size (half of the - varying - final sample size) and lag of new cohorts entering the platform (no lag). Please note that these are active simulation parameters that could be changed. We furthermore allow the platform trial to include additional cohorts up to a certain maximum number of cohorts even after other combination therapies have been declared successful. The platform trial therefore ends if at any time all currently enrolled cohorts have finished collecting the response on the primary endpoint, independent of how many cohorts have finished. We wanted to evaluate the impact of different types of data sharing, treatment effect assumptions, cohort inclusion probabilities, sample sizes, maximum number of cohorts and sensitivity and specificity of the biomarker used at interim in predicting the final endpoint on the operating characteristics. We furthermore allowed several scenarios regarding the optional SoC arm: 1) All cohorts include a SoC arm (”all SoC”), 2) no cohort includes a SoC arm (”no SoC”), 3) no further SoC arms are included once the backbone monotherapy has been found to be superior to SoC (in which case the backbone monotherapy can be seen as the new standard of care; ”stop post back”), 4) no further SoC arms are included once any monotherapy has been found to be superior to SoC (”stop post mono”). Please note that not including a SoC arm results in the allocated cohort sample size being split among the remaining three arms, thereby effectively increasing the sample size on the active arms within a cohort. For a detailed overview of the general simulation assumptions, as well as all possible simulation parameters for the R software package and Shiny App, see the R package CohortPlat vignette on GitHub. The R software package can be downloaded from GitHub or CRAN. 3.2 Treatment Efficacy Assumptions We investigated sixteen different settings with respect to the treatment efficacy assumptions for the com- bination arm, the monotherapy arms and the SoC arm. Two settings (settings 1 and 2) characterize a global null hypothesis, six settings (settings 3-8) characterize an efficacious backbone monotherapy with varying degrees of add-on mono and combination therapy efficacy, four settings (settings 9-12) characterize an efficacious backbone with varying degrees of random add-on mono and combination therapy efficacy, two settings (settings 13-14) characterize either the global null hypothesis or efficacious mono and combination therapies, but with an underlying time-trend, and two settings (settings 15-16) were run as sensitivity analyses with increased standard-of-case response rates. In the simulations conducted, we only investigated treatment effect assumptions based on risk-ratios, whereby we randomly and separately draw the risk-ratio for each of the monotherapies with respect to the SoC treatment. For the combination treatment, we randomly draw from a range of interaction effects, which could result in additive, synergistic or antagonistic effects of a specified magnitude. Some scenarios might be more realistic for a given drug development program than others, however we felt that the broad range of scenarios will allow to investigate the impact and interaction of the various simulation parameters and assumptions on the operating characteristics. Let πx denote the probability of a patient on therapy x to have a successful treatment outcome (binary), i.e. the response-rate, and Tx denote a discrete random variable. In detail, every time a new cohort enters the platform, we firstly assign the SoC response-rate: πSoC ∈ [0, 1] Then we assign the treatment effect in terms of risk-ratios for the backbone monotherapy (monotherapy A), which is the same across all cohorts: πM onoA = πSoC ∗ γM onoA , γM onoA ∼ TM onoA Then we randomly draw the treatment effect in terms of risk-ratios for the add-on monotherapy (monotherapy B): 9
πM onoB = πSoC ∗ γM onoB , γM onoB ∼ TM onoB Finally, after knowing the treatment effects of both monotherapies, we randomly drew an interaction effect for the combination treatment: πCombo = πSoC ∗ (γM onoA ∗ γM onoB ) ∗ γCombo , γCombo ∼ TCombo Depending on the scenario, the distribution functions can have all the probability mass on one value, i.e. the assignment of treatment effects and risk-ratios is not necessarily random. Please further note that while the treatment effects were specified in terms of risks and risk-ratios, the Bayesian decision rules were specified in terms of response rates. The different treatment efficacy settings are summarized in table 2. Due to the way the response rates are randomly assigned for every new cohort entering the trial, de- pending on the chosen priors for the treatment effects, it can happen that only inefficacious treatments were selected in one simulation iteration and only efficacious treatments were selected and another simulation iteration. To make sure that the operating characteristics are still meaningful, we differentiate between counting all simulation iterations (which implicitly takes into account the prior on the treatment effect, “BA” operating characteristics, FWER BA and Disj Power BA) or only those simulation iterations where a false decision could have been made towards the type 1 error rate and power (FWER and Disj Power). For more information see the formal definition of the operating characteristics in table 1. 10
Table 2: Overview of different treatment effect assumptions. The priors TM onoA , TM onoB and TComb for γM onoA , γM onoB and γComb as described in section 3.2 are all pointwise with a support of 1,2 or 3 different points and result in effective response rates πSoC , πM onoA , πM onoB and πComb . πM onoA πM onoB πCombo Setting πSoC Description (γM onoA ) (γM onoB ) (γCombo ) 1 0.10 0.10 (1) 0.10 (1) 0.10 (1) global null hypothesis 2 0.20 0.20 (1) 0.20 (1) 0.20 (1) global null hypothesis with higher response rates 3 0.10 0.20 (2) 0.10 (1) 0.20 (1) backbone monotherapy superior to SoC, but add-on monotherapy not superior to SoC and combination therapy not better than backbone monotherapy 4 0.10 0.20 (2) 0.10 (1) 0.30 (1.5) backbone monotherapy superior to SoC and combination therapy supe- rior to backbone monotherapy, but add-on monotherapy not superior to SoC 5 0.10 0.20 (2) 0.10 (1) 0.40 (2) backbone monotherapy superior to SoC and combination therapy supe- rior to backbone monotherapy (increased combination treatment effect compared to setting 4), but add-on monotherapy not superior to SoC 6 0.10 0.20 (2) 0.20 (2) 0.20 (0.5) both monotherapies are superior to SoC, but combination therapy is not better than monotherapies 7 0.10 0.20 (2) 0.20 (2) 0.30 (0.75) both monotherapies are superior to SoC and combination therapy is better than monotherapies 8 0.10 0.20 (2) 0.20 (2) 0.40 (1) both monotherapies are superior to SoC and combination therapy is superior to monotherapies (increased combination treatment effect compared to setting 7) 0.10 (1) with p 0.5 0.20 (1) if γM onoB = 1 9 0.10 0.20 (2) backbone monotherapy superior to SoC, add-on monotherapy has 50:50 0.20 (2) with p 0.5 0.40 (1) if γM onoB = 2 11 chance to be superior to SoC; in case add-on monotherapy not superior to SoC, combination therapy as effective as backbone monotherapy, otherwise combination therapy significantly better than monotherapies 0.20*γM onoB *0.5 (0.5) with p 0.1 0.10 (1) with p 0.8 10 0.10 0.20 (2) 0.20*γM onoB *1 (1) with p 0.8 backbone monotherapy superior to SoC, add-on monotherapy has 80:20 0.20 (2) with p 0.2 0.20*γM onoB *1.5 (1.5) with p 0.1 chance to be superior to SoC; combination therapy interaction effect can either be antagonistic/non-existent, additive or synergistic (10:80:10) 0.20*γM onoB *0.5 (0.5) with p 0.1 0.10 (1) with p 0.5 11 0.10 0.20 (2) 0.20*γM onoB *1 (1) with p 0.8 backbone monotherapy superior to SoC, add-on monotherapy has 50:50 0.20 (2) with p 0.5 0.20*γM onoB *1.5 (1.5) with p 0.1 chance to be superior to SoC; combination therapy interaction effect can either be antagonistic/non-existent, additive or synergistic (10:80:10) 0.20*γM onoB *0.5 (0.5) with p 0.1 0.10 (1) with p 0.2 12 0.10 0.20 (2) 0.20*γM onoB *1 (1) with p 0.8 backbone monotherapy superior to SoC, add-on monotherapy has 50:50 0.20 (2) with p 0.8 0.20*γM onoB *1.5 (1.5) with p 0.1 chance to be superior to SoC; combination therapy interaction effect can either be antagonistic/non-existent, additive or synergistic (10:80:10) 0.10 + 0.10 + 13 0.10 + 0.03*(c-1) (1) 0.10 + 0.03*(c-1) (1) time-trend null scenario; every new cohort (first cohort c = 1, second 0.03*(c-1) 0.03*(c-1) (1) cohort c = 2, ...) will have SoC response rate that is by 3%-points higher than that of the previous cohort 0.10 + 0.20 + 14 0.20 + 0.03*(c-1) (2) 0.40 + 0.03*(c-1) (1) time-trend scenario, whereby monotherapies superior to SoC and com- 0.03*(c-1) 0.03*(c-1) (2) bination therapy superior to monotherapies; every new cohort (first cohort c = 1, second cohort c = 2, ...) will have SoC response rate that is by 3%-points higher than that of the previous cohort 15 0.20 0.30 (1.5) 0.30 (1.5) 0.40 ( 89 ) analogous to setting 7, but SoC response rate is 20% 16 0.20 0.30 (1.5) 0.30 (1.5) 0.50 ( 10 9 ) analogous to setting 8, but SoC response rate is 20%
3.3 Decision Rules The exact set of decision rules and the chosen testing strategy are linked, e.g. if no SoC arm exists, the decision rules involving SoC are ignored in the sense that any decision will be based only on the comparisons of the combination therapy and the monotherapies. This means that for cohorts with four arms, we use testing strategy 3 and for cohorts with three arms, we use testing strategy 1 (see section 2.2). For all response rates, we assume a vague Beta(0.5, 0.5) prior. We included two types of decision rules, which we will label strict decision rules and relaxed decision rules. Please note that these decision rules were chosen globally for all possible combinations of different simulation parameters and treatment efficacy settings. It should be obvious that this will yield dramatically different power and type 1 errors across these simulations. It was our primary goal to investigate the relative impact of the simulation parameters and treatment efficacy settings on the operating characteristics. Of course, for any given combination of simulation parameters and treatment efficacy assumptions, the decision rules could be adapted to e.g. achieve a per-cohort power of 80%. We will investigate the impact of the decision rules in more detail in section 4.6.2. In the simulation study, we differentiate three types of decisions for cohorts: “GO” (graduate combination therapy, i.e. declare combination therapy successful; either at interim or final), “STOP” (stop evaluation of combination therapy and do not graduate, i.e. declare the combination therapy unsuccessful; either at interim or final) and “Continue” (continue evaluation of combination therapy; at interim). 3.3.1 Strict Decision Rules Decision rules which use δ > 0 for comparing the rates such as y ≥ x + δ at the final analysis were labelled as strict decisions rules. Here we use the following strict decision rules at the final analysis: GO, if (P (πComb > πM onoA + 0.10|Data) > 0.8) ∧ (P (πComb > πM onoB + 0.10|Data) > 0.8) ∧ (P (πM onoA > πSoC + 0.05|Data) > 0.8) ∧ (P (πM onoB > πSoC + 0.05|Data) > 0.8) STOP, otherwise Whereby P (πArmY > πArmX |Data) denotes the posterior probability of the comparison of arm Y with arm X in a cohort. Furthermore, for each cohort an interim analysis using more liberal decision rules is performed to decide whether to continue or stop the cohort for early efficacy/futility: GO, if (P (πComb > πM onoA + 0.05|Data) > 0.8) ∧ (P (πComb > πM onoB + 0.05|Data) > 0.8) ∧ (P (πM onoA > πSoC |Data) > 0.8) ∧ (P (πM onoB > πSoC |Data) > 0.8) STOP, if (P (πComb > πM onoA |Data) < 0.6) ∧ (P (πComb > πM onoB |Data) < 0.6) ∧ (P (πM onoA > πSoC |Data) < 0.6) ∧ (P (πM onoB > πSoC |Data) < 0.6) CONTINUE, otherwise 12
3.3.2 Relaxed Decision Rules Decision rules, where all δ = 0 at the final analysis, were labelled as relaxed decisions rules (such decision rules correspond to decision making with frequentist superiority tests). To compensate for such relaxed boundaries, one might require larger γ as threshold for some of the posterior probabilities of the required comparisons (Posterior P(y ≥ x + δ|Data) > γ). Please note that also non-inferiority decision rules with δ < 0, or - as δ can be different for the different comparisons - mixed decision rules, where e.g. the comparison between combination and monotherapies requires superiority, but the comparison between monotherapies and SoC requires only non-inferiority, are possible. For the simulations we use the following relaxed decision rules: GO, if (P (πComb > πM onoA |Data) > 0.9) ∧ (P (πComb > πM onoB |Data) > 0.9) ∧ (P (πM onoA > πSoC )|Data > 0.8) ∧ (P (πM onoB > πSoC )|Data > 0.8) STOP, otherwise At interim, we use the following decision rules: GO, if (P (πComb > πM onoA |Data) > 0.8) ∧ (P (πComb > πM onoB |Data) > 0.8) ∧ (P (πM onoA > πSoC |Data) > 0.7) ∧ (P (πM onoB > πSoC |Data) > 0.7) STOP, if (P (πComb > πM onoA |Data) < 0.5) ∧ (P (πComb > πM onoB |Data) < 0.5) ∧ (P (πM onoA > πSoC |Data) < 0.5) ∧ (P (πM onoB > πSoC |Data) < 0.5) CONTINUE, otherwise 4 Results As described in section 3, some parameters were kept constant in the simulations and a range of other parameters was varied. For every scenario, 10000 platforms trials were simulated. In this section, we present a selection of all simulations that were conducted, illustrating the most pertinent features. Unless otherwise specified, we set the maximum number of cohorts per platform trial to 7, the probability of starting a new cohort after every patient to 3%, always include a SoC arm in the cohorts, use the strict decision rules and set the sensitivity of the final outcome in predicting the interim outcome to 90% (i.e., if yij are the interim (j = 0) and final (j = 1) responses of patient i, then P (yi0 = 1|yi1 = 1) = P (yi0 = 0|yi1 = 0) = 0.9). Conceptually it would make more sense to specify the sensitivity and specificity of the interim outcome in predicting the final outcome (i.e. P (yi1 = 1|yi0 = 1) and P (yi1 = 0|yi0 = 0)), however from a probabilistic perspective this would put constraints on the final response rate with respect to the chosen sensitivity and specificity (otherwise the probability of observing an interim outcome could be smaller than 0), making e.g. a final response rate of 0.1, together with a sensitivity of 90 % and specificity of 85 % impossible. Since we believe the main quantity of interest is the final response rate, and we wanted to avoid having to double check for 13
every combination of final response rate and sensitivity/specificity whether it is feasible, we chose to specify the sensitivity and specificity of the final outcome in predicting the interim outcome. To be specific, when the final response rate is x% and sensitivity and specificity of the final outcome in predicting the interim outcome are set to se% and sp% respectively, the joint probability distribution for the interim and final event is as follows: P (yi1 = 1, yi0 = 1) = se ∗ x, P (yi1 = 1, yi0 = 0) = (1 − se) ∗ x, P (yi1 = 0, yi0 = 1) = (1 − sp) ∗ (1 − x), P (yi1 = 0, yi0 = 0) = sp ∗ (1 − x). For a complete overview of all simulation results, we developed a Shiny App facilitating self-exploration of all of our simulation results. The purpose of R Shiny app is to quickly inspect and visualize all simulation results that were computed for this paper. We uploaded the R Shiny app, alongside all of our simulation results used in this paper to our server. Visualizations are based on the looplot package [33], which implements the visualisation presented by Rücker and Schwarzer [34]. Furthermore, a table of all simulation results can be found in the supplements. 4.1 Global null hypothesis We investigated two global null hypotheses, (i) with the response rates of all arms equal to 0.10 (setting 1) and (ii) with the response rates of all arms are equal to 0.20 (setting 2). Results are presented in figure 3. As in these settings no true positive or false negative decision can be made, only PCT1ER and FWER in terms of error rates are presented. We furthermore show the average sample size of the entire platform trial. As expected, the average number of patients increases with the planned final sample size per cohort, decreases when using the strict decision rules (as more ineffective cohorts are stopped at interim for futility), however no major differences between settings or types of data sharing can be observed. For both the strict and relaxed decision rules, the PCT1ER generally decreases with increasing cohort sample size (as would be expected). For the strict decision rules, the FWER mostly decreases with increasing cohort sample size even though the average number of cohorts increases (e.g. when using ”cohort” data sharing and superiority decision rules, the average number of cohorts is around 5 when the final cohort sample size is 100 and close to 7 when the final cohort sample size is 400). However, in some cases for the strict decision rules and in most cases when using the relaxed decision rules, it happens that the PCT1ER decreases so slowly with increasing cohort sample size that as a result the FWER increases with increasing cohort sample size, since with increasing cohort sample size the probability of including an additional cohort increases (e.g. consider the following example: when due to increase in cohort sample size the PCT1ER decreases from 0.011 to 0.01, while the average number of cohorts increases from 5 to 6, the FWER will increase from 1 − 0.9895 = 0.0538 to 1 − 0.996 = 0.0585). This phenomenon occurs for both sets of decision rules, however since the PCT1ER is generally higher for the relaxed decision rules, it is more visible in this case. In a few cases, we even observed a slight increase in PCT1ER with increasing cohort sample size, but additional simulations revealed that this was just due to random variation in the simulation runs as we are dealing with rare events (e.g. a FWER of 0.0026 translates to at least one false positive decision in 26/10000 platforms), we believe this might be well within the expected simulation error. In terms of data sharing, the effects were as expected, with the PCT1ER and consequently the FWER decreasing with increasing level of data sharing (since all cohorts shared the same underlying truth). Interestingly, in most cases when performing dynamic data sharing, the error rates were lower compared to always using all backbono monotherapy and SoC data, which could be a result of discounting extreme highs or lows in the backbone monotherapy and SoC arm, as well as a reflection of the importance of the prior on the degree of borrowing (see appendix A). 4.2 Efficacious backbone monotherapy and unclear add-on monotherapy effi- cacy For settings 3-8 we assumed the backbone monotherapy to be efficacious (response rate of 20% throughout, compared to SoC with a response rate of 10%). The add-on monotherapy and the combination therapy treatment effects are varied independently of each other (add-on monotherapy either 10% or 20% response rate and combo therapy response rate varying from 20% to 40%). Figure 4 shows the results. For cases where the add-on monotherapy response rate is 0.1 and/or the combination therapy response rate is 0.2, only type 1 error related error rates are shown, as there exist no correct positive and false negative decisions under these 14
circumstances. Similarly, for the rest of the scenarios, only power related operating characteristics are shown. In terms of average total number of patients in the platform trial, we observe that pooling all data leads to a consistently higher number of patients than not sharing any data, which is due to a higher percentage of early stopping (for futility or efficacy) when pooling. In settings 4 and 5, the per-cohort type 1 error and the FWER increase with increasing cohort sample size. This is most likely due to the combined decision rule, where apparently the probability to declare combination (correctly) superior to the mono therapies increases faster than the probability to declare the add-on monotherapy (correctly) not superior to SoC. This is due to our definition of the error rates, whereby any incorrect decision in the individual comparisons leads to an overall incorrect decision, even though 3/4 decisions might be correct. In settings 7 and 8, we observe an increase in per-cohort power and disjunctive power when the cohort sample size increases. Both are maximized when pooling all data and minimized when sharing no data. In setting 3, although not visible in the figure, it becomes more apparent that the PCT1ER slightly decreases with increasing cohort sample size, although not fast enough the stop the FWER from increasing as the average number of cohorts increases. This is consistent with our conjecture that the type 1 error rates in settings 4 and 5 increase due to the combined decision rule. 4.3 Efficacious backbone monotherapy and random add-on monotherapy effi- cacy In setting 9, the backbone monotherapy (response rate 20%) is superior to SoC (response rate 10%) and the add-on monotherapy efficacy is random, with 50% probability to be as efficacious as the backbone monotherapy (response rate 20%) and 50% probability to be inefficacious (response rate 10%). Building on top of the monotherapies, the combination interaction effect is additive, meaning the combination therapy is superior to both monotherapies if the add-on monotherapy is efficacious (response rate 40%) and not superior to the backbone monotherapy otherwise (response rate 20 %). The results are presented separately for power and type 1 error related error rates in figures 5 and 6. In terms of total platform sample sizes, we see a slightly different pattern for a bad (sensitivity and specificity of 0.65) and a good (sensitivity and specificity of 0.90) biomarker at interim. With one exception (using no SoC arm and having a good biomarker), the total platform sample sizes increase with the level of data sharing. This difference is less pronounced when the biomarker is good. In terms of error rates, as expected using a bad biomarker drastically increases the FDR and decreases the power. For both good and bad biomarkers, both type 1 errors and power increase as we increase the probability of dropping SoC mid-trial or not include SoC at all. These examples also show a clear difference between FWER and FWER BA as well as Disj Power and Disj Power BA. Please remember from table 1 that error rates without the ending ”BA” are defined as proportions of platform trials where the hypothesis of interest was true for at least one cohort, e.g. when computing the FWER, we divide the number of platform trials with at least one false positive decision by the number of platform trials where at least one cohort was in truth inefficacious. In the ”BA” error rates, we disregard whether or not the hypothesis of interest was true for at least one cohort and always divide by the total number of platform trials simulated. This can be interpreted as taking the prior distribution on the effectiveness of the treatments into account. No ”BA” error rates are always increased compared to ”BA” error rates, as we would expect since the denominator is smaller. 4.4 Efficacious backbone monotherapy and random add-on mono and combina- tion therapy efficacy For settings 10-12, the backbone monotherapy (response rate 20%) is superior to SoC (response rate 10%) and the add-on monotherapy efficacy is random, with D% probability to be as efficacious as the backbone monotherapy (response rate 20%) and 1-D% probability to be inefficacious (response rate 10%), where D varies from 0.2 in setting 10 to 0.8 in setting 12. Building on top of the monotherapies, the combination interaction effect is random (see table 2). The results are presented separately for power and type 1 error related error rates in the supplements in figures 10 and 11. For these settings, the results are comparable to those of setting 9. Additionally it becomes apparent that the confidence in the efficacy of the add-on monotherapy has a noticeable impact on the error rates. Both the power increases and the type 1 error decreases when the confidence in the add-on monotherapy’s efficacy increases. 15
4.5 Time Trend Finally, we looked at two situations where a time trend in response rates exists that affects all treatment arms (e.g. improvement of standard of care). Firstly, we consider a global null hypothesis where all the response rates are 10% + (c-1)*3, whereby c denotes the cohort number in the trial, meaning within the first cohort all treatment arms have a response rate of 10%, in the second cohort all treatments have a response rate of 13% etc. Secondly, we assume a true alternative hypothesis where the SoC response rates are 10% + (c-1)*3, the mono response rates are 20% + (c-1)*3 and the combination therapy response rates are 40% + (c-1)*3. These two situations correspond to treatment efficacy settings 13 and 14. Results are presented in the supplements in figure 12. For setting 13, only type 1 error related operating characteristics are shown (as there are no true positive decisions), while for setting 14 only power related operating characteristics are shown (as there are no false positive decisions). In terms of total number of patients in the platform trial, in setting 13 this number is slightly elevated when sharing more data and visibly elevated when not including a SoC arm. In setting 14, it depends on the trial structure with respect to the optional SoC arm. When no SoC arm is used, the data sharing leads to slightly decreased sample sizes. When no SoC arm is used, the probability to declare the combination therapy successful is higher, leading to increased type 1 error and power compared to designs where a SoC arm is used. We observe consistently higher PCT1ER, PCP, FWER and disjuctive power when sharing more data. This shows that when borrowing from other sources (e.g. from other cohorts, both concurrent and non-concurrent) the type 1 error is negatively affected. It is well known that when borrowing data, simultaneous type 1 error control and increased power is not achievable [35]. For the following comparison we fix a final sample size of 400, pooling all data and always including a SoC arm. When comparing setting 13 to setting 1 (which is settings 13’s equivalent without time trend), the average number of patients (1508 vs 1473) and the per-cohort type 1 error rates are increased (0.0015 vs 0.00016) when there is a time trend. When comparing setting 14 to setting 8 (which is setting 14’s equivalent without time trend), the average number of patients (1951 vs 1990) and the per-cohort power are decreased (0.428 vs 0.496) when there is a time trend. 4.6 Sensitivity Analyses 4.6.1 SoC Response Rate Compared to setting 8, in setting 16 we increased the SoC response rate from 10% to 20% but kept the incremental increases in response rate the same (i.e. 10% points increased for the mono therapies and 30% points increased for the combination therapy). Results for the power are presented in the supplements in figure 13. While all the relative influences of level of data sharing, use of decision rules, etc. appear to be unchanged, we observed consistently slightly lower power for setting 16 compared to setting 8, which could be due to increased variance in the observed response rates in setting 16 compared to setting 8. We observed no major differences in the total platform sample sizes. 4.6.2 Impact of Decision Rules We further investigated the impact of parameters required for the Bayesian GO decision rules on the error rates. Every individual comparison includes a superiority margin δ and a required confidence γ (e.g. posterior (P (πComb > πM ono1 + δ|Data) > γ)). In the decision rules in section 3.3.1, δ ∈ {0, 0.05, 0.10} and γ = 0.80. We varied γ from 0.65 to 0.95 and applied a multiplicative factor δmult to each δ, whereby we varied δmult from 0 to 2. As treatment effect scenarios, setting 1 was chosen for the null scenario and setting 8 for the alternative (see table 2). The impact on the error rates PCT1ER, PCP, FWER and Disjunctive power is presented in figure 7. While the decision rules chosen in section 3.3.1 might appear overly conservative with respect to FWER, figure 7a reveals that both when the final cohort sample size is 100 and 400 and more lenient decision rules are chosen, the FWER can increase beyond 0.05 and even 0.10. Similarly, figure 7b reveals that when choosing more lenient decision rules and increasing the final cohort sample size, a PCP of close to 0.80 is attainable. Such contour plots will help to fine tune the Bayesian decision rules in order to achieve the desired operating characteristics. For γ, one would usually expect values equal to or larger than 80%. The choice of δ also 16
relies on clinical judgment. Generally, as a result of the combined decision rule, operating characteristics are rather conservative. 17
18 Figure 3: Operating characteristics (OC) for treatment efficacy settings 1 and 2 with respect to planned final sample size (x-axis), type of data sharing (columns), null setting (rows) and set of decision rules (linetype).
You can also read