Come Sit with Me for a While

Over the last two months, I’ve helped at some of KP Washington’s weekend COVID-19 vaccination clinics.  After a year of pandemic disruption and worry, giving vaccine shots is really a joy.  I’ll definitely remember the day we vaccinated over 1,100 schoolteachers.

A few weeks ago, I was assigned to the post-vaccination observation area.  Watching for allergic reactions is not as gratifying as giving shots.  But spending several hours watching over the recently vaccinated did give me time to think about that type of work.

Fortunately, we didn’t see any severe or dangerous reactions to the vaccine.  Several people did feel lightheaded or dizzy after vaccination, but those reactions were more ordinary anxiety than allergy.  My job was to respond quickly to any serious reaction and not overreact to reactions that weren’t dangerous.  Epi-pens, IVs, and stretchers were just out of sight; we’d rather not bring them out unnecessarily.  Those interventions are pretty intrusive – and alarming to everyone else watching.

Looking out over the post-vaccine observation area, I was reminded of my days as a lifeguard at the city pool during high school summers.  That lifeguard job was actually pretty similar to post-vaccine observation.  We often watched over several hundred people in an afternoon.  My job was to react quickly to any real danger while not diving in to rescue anyone who didn’t need that kind of help.  Grabbing someone to drag them to the side of the pool is also pretty intrusive – and alarming to everyone watching.

And that got me thinking about my current day job in mental health care – especially our work regarding outreach and population-based care.  There as well, we hope to respond quickly to people with urgent needs while avoiding any over-reaction that’s intrusive or alarming.  The help we offer shouldn’t be overly restrictive or overly medicalized.

But how can we know who needs more urgent – or even more intrusive – help?  I decided that my post-vaccine observation work could borrow a practice from my lifeguard days.  Back then, whenever I fished someone out of the deep end, I’d ask them to sit next to me on the lifeguard bench for a while.  And we’d talk a bit.  Sitting and talking together was both diagnosis (Is this swimmer ready to go back in the water?) and treatment (“I think you’ll be fine – now enjoy the rest of your day”).  So I tried that out in the vaccination clinic.  When one of our vaccinated felt dizzy or faint, I’d ask them to come sit next to me so I could take their hand to feel their pulse.  We would make small talk so I could see their face and hear their voice.  And after a while I could say “I think you’ll be fine – now enjoy the rest of your day.”

Greg Simon

The Cautionary Tales of Scott Atlas and Dr. Oz

Three events in the last month got me thinking about the role that clinicians and researchers (like me) play in the popular media.

A group of Stanford faculty recently published an opinion piece in JAMA criticizing Scott Atlas for spreading misinformation and recommending misguided policies regarding the COVID-19 pandemic.  They argued (and I agree) that Atlas, a non-practicing neuroradiologist, misused the credibility and authority attached to his medical training.  That criticism of Atlas cited the American Medical Association’s Code of Medical Ethics Opinion that physicians “…should ensure that the medical information they provide is…commensurate with their medical expertise and based on valid scientific evidence.”

Around the same time, I did an interview for a local TV station regarding effects of the pandemic on mental health and risk of suicide.  I talked about all the ways that the pandemic and related social and economic disruption could increase risk for mental health problems and suicidal behavior.  And, because the interviewer asked, I gave my advice about self-care: keeping a regular sleep schedule, establishing a new daily routine, finding new ways to create or maintain social connection, avoiding over-use of alcohol or cannabis.  But I was careful to be honest about the limits of current knowledge and my expertise.  We don’t yet know if or how the pandemic affected rates of suicidal behavior.  Emphasizing the limits of our knowledge probably reduces the chances that my face will appear on television, and that’s fine with me.

My advice about self-care during the pandemic was reasonably commensurate with my medical expertise.  I am a practicing mental health clinician rather than a non-practicing neuroradiologist.  So I know something about self-management of common mental health problems.  But I have no greater expertise about coping with the pandemic than most mental health clinicians.  And my expertise about coping with widespread fear and loss was mostly common sense rather than new knowledge gained through my research.

It’s not wrong to promote or reinforce common-sense messages about self-care.  Even if I have no special expertise, my status as a clinician and researcher may lend those useful messages more credibility.  But I should be careful not to claim special authority or credibility that I don’t actually own.   Media appearances are just a tool to support public health messages rather than an end unto themselves.  Aiming to become an “influencer” can lead down a slippery slope away from medical expertise and valid scientific evidence.

And I must be especially careful about confusing medical or scientific expertise with my personal values.  All researchers and clinicians are certainly entitled to hold and express personal values.  Scott Atlas is entitled to his values, and I am entitled to mine.  And I suspect our two sets of values have just about zero overlap.  In my opinion, Atlas repeatedly failed to separate his values (personal “liberty” above all else) from his medical expertise or scientific evidence.  Having seen that, I try to be more careful about the distinction.  I practice starting statements about my values with “I believe that…” and statements about research (mine or others’) with “We have evidence that…”.  My values are central to my motivation, but they are not evidence.

I was thinking about these issues when I read that Dr. Oz will appear as a guest host on Jeopardy.  Before Scott Atlas grabbed the spotlight, Dr. Oz was the cautionary tale of a physician-turned-influencer dispensing medical advice not commensurate with their medical expertise.  And he has his own history of spreading COVID-19 misinformation.  In the latest move down the slippery slope, Dr. Oz has slid from a surgeon hawking questionable weight loss treatments to “I’ll take Snake Oil Before and After for $200” on the Jeopardy soundstage.  Note to self: If I’m ever invited to host a TV game show, then I’ll know that I’ve slid down much too far.

Greg Simon

Did Suicide Deaths Increase or Decrease During the Pandemic? – Another Embarrassing Mistake Avoided

I’ve previously written that healthcare research needs a Journal of Silly Mistakes We Almost Made.  Until that journal is established, I’ll have to share my examples with readers of this blog.  It’s a small audience, but I like to think it’s a discerning one.  Here’s the latest example for that hypothetical journal:

The COVID-19 pandemic has been called the “perfect storm” of increased risk for suicide.  Concerns about increasing suicide mortality have even entered the political debate about the negative effects of restrictions on businesses, public gatherings, and in-person schooling.  But that discussion has been a largely data-free zone.  Official US national statistics on suicide deaths in 2020 will not be released until the fall of 2021.

We hoped that interim mortality data from state departments of health might allow an early look at the effects of the pandemic on suicide mortality.  Our first look at Washington state data for January through June of 2020 seemed to indicate a decrease in suicide deaths during April and May.  But the number of suicide deaths recorded for June was implausibly low, so we thought we should wait for the next quarterly update before trusting data for April and May. 

New state data for deaths recorded through October told a very different story.  After including the additional suicide deaths in the new data, the appearance of decreased suicide mortality between April and June almost completely disappeared.  And the new data for deaths officially recorded during the fall added significantly to suicide deaths occurring as far back as February and March.  So we suspect that suicide death statistics for April, May, and June may still not be complete. 

Using the most recent data, we can probably say that there was no dramatic decrease in suicide mortality in Washington during the first three months of the pandemic.  But we can’t yet rule out the possibility of a significant increase. 

Although the most recent state mortality data added only a few deaths overall during the first quarter of 2020, most of those delayed reports were suicide deaths.  It makes sense that suicide deaths would be classified and recorded more slowly than “medically attended” deaths occurring in a hospital.  We’re now looking at delays in reporting of suicide deaths across MHRN health systems.  It’s possible that the delays we’ve seen are unique to Washington state or unique to the chaos of 2020.  We’ll only know that if we look at trends in multiple states, checking carefully for signs that data are not complete.  I expect we’ll soon have a clearer picture about the spring and early summer.  But we may need to wait a few more months to have accurate data about the fall.

Because we’ve learned from previous mistakes we almost made, we didn’t rush to press with the early news that suicide mortality in Washington state appeared to decrease during the COVID-19 pandemic.  Our new message – that we just can’t know the answer yet – will not make any headlines.  Others have used early 2020 mortality records to report that overall suicide mortality decreased during the first months of the pandemic.  I hope they are right. And I certainly hope they were careful.

Greg Simon

Noticing My Tailwind

One of my favorite Seattle bike rides goes south along the shore of Lake Washington to Seward Park and then back north to Capitol Hill.  That route was especially pleasant during the past summer when Lake Washington Boulevard was temporarily closed to cars.

On sunny days when biking is most inviting, the prevailing winds usually blow from the north. That means I’ve got the wind behind me traveling south and against me when I’m heading home. Unless the wind is especially strong, though, I rarely notice the tailwind pushing me south down the lake. I only notice the headwind when I turn around to head back north.

During the past summer, failing to notice the wind behind me became my personal metaphor for privilege. When I’m getting help from my usual tailwind, I still feel like I’m pedaling hard. And I usually am pedaling hard.  My legs get at least some of the credit (maybe even most of the credit) for speeding south down the lake. But the wind behind me is certainly an advantage, whether I notice it or not. It’s easy not to notice the help I’ve been getting until it disappears.

Coincidentally, the prevailing wind direction mirrors the social geography of Seattle. People who would hope to move from South Seattle up to Capitol Hill will usually face more of a headwind.

I’m trying to frequently remind myself that the prevailing winds have more often blown in my direction. It’s not as if I’ve never been treated unfairly or faced any headwind. But when that happens, I usually take notice – because it’s not very common. Noticing my tailwind takes a more conscious effort since I’m much more accustomed to a gentle breeze at my back.

Greg Simon

Looking for Equipoise

Controversy regarding potential treatments and vaccines for COVID-19 has brought arguments about randomized trial methods into the public square.  Disagreements about interim analyses and early stopping of vaccine trials have moved from the methods sections of medical journals to the editorial section of the Wall Street Journal.   I never imagined a rowdy public debate about the more sensitive Pocock stopping rule versus the more patient O’Brien-Fleming rule.  A randomized trial nerd like me should be heartened that anyone even cares about the difference.  I imagine it’s how Canadians feel during Winter Olympic years when the rest of the world actually watches curling.  But the debates have grown so heated that choice of a statistical stopping rule has become a test of political allegiance.  And that drove me to some historical reading about the concept of equipoise.

As originally understood, the ethical requirement for equipoise in randomized trials applied to the individual clinician and the individual patient.  Each clinician was expected to place their duty to an individual patient over any regard for public health or scientific knowledge.  By that understanding, a clinician would only recommend or participate in a randomized trial if they had absolutely no preference.  If they preferred one of the treatments being compared, they were obligated to recommend that treatment and recommend against joining a randomized trial.    

In 1987, Benjamin Freedman suggested an alternative concept of “clinical equipoise”, applied at the community level rather than to the individual patient.  A clinician might have a general preference for one treatment over another.  Or the clinician might believe that one treatment would be preferred for a specific patient.  But that clinician might still suggest that their patient participate in a randomized trial if the overall clinical or scientific community was uncertain about the best choice. 

Freedman’s discussion of equipoise offers some useful words for navigating current controversies regarding COVID-19 clinical trials.  He suggested that a clinician could ethically particicipate in a randomized trial or recommend participation to their patients “if their less-favored treatment is preferred by colleagues whom they consider to be responsible and competent.”  A clinician is not expected to be ignorant or indifferent (“I have no idea what’s best for you.”).  Instead, they are expected to be honest regarding uncertainty (“I believe A is probably better, but many experts I respect believe that B is best.”).    Freedman proposed that a randomized trial evaluating some new treatment is ethically appropriate if success of that treatment “appears to be a reasonable gamble.”  That language accurately communicates both the hope and uncertainty central to randomized trials of new treatments.  It also communicates that odds of success can change with time.  A gamble that appeared reasonable when a randomized trial started may become less reasonable as new evidence – from within the trial or outside of it – accumulates.  Reasonable investigators must periodically reconsider their assessments of whether the original gamble remains reasonable.  That’s where those stopping rules come in. The concept of clinical equipoise also helps to frame our MHRN randomized trials of new mental health services or programs.  We are often studying treatments or programs that we have developed or adapted, so we are neither ignorant nor indifferent.  We certainly care about the trial outcome, and we usually hope the new treatment we are testing will be found helpful.  But we do randomized trials because our beliefs and hopes are not evidence.

Greg Simon

Judging Which Model Performs Best

Projections of the course of the COVID-19 pandemic prompted vigorous arguments about which disease model performed the best.  As competing models delivered divergent predictions of morbidity and mortality, arguments in the epidemiology and medical twitterverse grew as rowdy as the crowd in that male model walk-off from the movie Zoolander.  Given the rowdy disagreement among experts, how can we evaluate the accuracy of competing models yielding divergent predictions?  With all respect to the memory of David Bowie (the judge of that movie modeling competition), isn’t there some objective way of judging model performance?  I think that question leads to a key distinction between types of mathematical models.

Most models predicting individual-level events (such as our MHRN models predicting suicidal behavior) follow an empirical or inductive approach.  An inductive model begins with “big data”, usually including a large number of events and a large number of potential predictors.  The data then tell us which predictors are useful and how much weight each should be given.  Theory or judgment may be involved in assembling the original data, but the data then make the key decisions.  Regardless of our opinions about what predictors might matter, the data dictate what predictors actually matter.

In contrast, models predicting population-level change (including many competing models of COVID-19 morbidity and mortality) often follow a mechanistic or deductive approach.  A deductive model assumes a mechanism of the underlying process, such as the susceptible-infected-recovered model of infectious disease epidemics.  We then attempt to estimate key rates or probabilities in that mechanistic model, such as the now-famous reproduction number or R0 for COVID-19.  “Mass action” models apply those rates or probabilities to groups, while “Agent-based” models apply them to simulated individuals.  In either case, though, an underlying structure or mechanism is assumed.  And key rate or probabilities are usually estimated from multiple sources – usually involving at least some expert opinion or interpretation.   

Judging the performance of empirical or inductive prediction models follows a standard path.  At a minimum, we randomly divide our original data into one portion for developing a model and a separate portion for testing or validating it.  Before using a prediction model to inform practice or policy, we would often test how well it travels – by testing or validating it in data from a later time or a different place.  So far, our MHRN models predicting individual-level suicidal behavior have held up well in all of those tests.

That empirical validation process is usually neither feasible nor reasonable with mechanistic or deductive models, especially in the case of an emerging pandemic.  If our observations are countries rather than people, we lack the sample size to divide the world into a model development sample and a validation sample.  And it makes no sense to validate COVID-19 predictions based on how well they travel across time or place.  We already know that key factors driving the pandemic vary widely over time and place.  We could wait until the fall to see which COVID-19 model made the best predictions for the summer, but that answer will arrive too late to be useful.  Because we lack the data to judge model performance, the competition can resemble a beauty contest.

I am usually skeptical of mechanistic or deductive models.  Assumed mechanisms are often too simple.  In April, some reassuring models used Farr’s Law to predict that COVID-19 would disappear as quickly as it erupted.  Unfortunately, COVID-19 didn’t follow that law.  Even when presumed mechanisms are correct, estimates of key rates or probabilities in mechanistic models often depend on expert opinion rather than data.  Small differences in those estimates can lead to marked differences in final results.  In predicting the future of the COVID-19 pandemic, small differences in expectations regarding reproduction number or case fatality rate lead to dramatic differences in expected morbidity and mortality.  When we have the necessary data, I’d rather remove mechanistic assumptions and expert opinion estimates from the equation.

But we sometimes lack the data necessary to develop empirical or inductive models – especially when predicting the future of an evolving epidemic.  So we will have to live with uncertainty – and with vigorous arguments about the performance of competing models.  Rather than trying to judge which COVID-19 model performs best, I’ll stick to what I know I need to do:  avoid crowds (especially indoors), wash my hands, and wear my mask!

Greg Simon

Incredible research is sometimes not credible

The controversy around COVID-19 research from the Surgisphere database certainly got my attention.  Not because I have any expertise regarding treatments for COVID-19.  But because I wondered how that controversy would affect trust in research like ours using “big data” from health records

In case you didn’t follow the story, here’s my brief summary: Surgisphere is a data analytics company that claims to gather health records data from 169 hospitals across the world.  In early May, Surgisphere researchers and academic collaborators published a paper in NEJM showing that drugs used to treat cardiovascular disease did not increase severity of COVID-19.  That drew some attention.  In late May, a second paper in Lancet reported that hydroxychloroquine treatment of COVID-19 had no benefit and could increase risk of abnormal heart rhythms and death.  Given the scientific/political controversy around hydroxychloroquine, that finding drew lots of attention.  Intense scrutiny across the academic internet and twitterverse revealed that many of the numbers in both the Lancet paper and the earlier NEJM paper didn’t add up.  After several days of controversy and “expressions of concern”, Surgisphere’s academic collaborators retracted the article, reporting that they were not able to access the original data to examine discrepancies and replicate the main results.

Subsequent discussions questioned why journal reviewers and editors didn’t recognize the signs that Surgisphere data were not credible.  Spotting the discrepancies, however, would have required detailed knowledge regarding the original health system records and the processes for translating those records data into research-ready datasets.  I suspect most reviewers and editors rely more on the reputation of the research team than on detailed inspection of the technical processes.

Our MHRN research also gathers records data from hundreds – or even thousands – of facilities across 15 health systems.  We double-check the quality of those data constantly, looking for oddities that might indicate some technical problem in health system databases or some error in our processes.  We can’t expect that journal reviewers and editors will triple-check all of our double-checking. 

I hope MHRN has established a reputation for trustworthiness, but I’m ambivalent about the scientific community relying too much on individual or institutional reputation.  I do believe that the quality of MHRN’s prior work is relevant when evaluating both the integrity of MHRN data and the capability of MHRN researchers.  Past performance does give some indication of future performance.  But relying on reputation to assess new research will lead to the same voices being amplified.  And those loudest voices will tend to be older white men (insert my picture here).

I’d prefer to rely on transparency rather than reputation, but there are limits to what mental health researchers can share.  Our MHRN projects certainly share detailed information about the health systems that contribute data to our research.  The methods we use to process original health system records into research-ready databases are well-documented and widely imitated.  Our individual research projects publish the detailed computer code that connects those research databases to our published results.  But we often cannot share patient-level data for others to replicate our findings.  Our research nearly always considers sensitive topics like mental health diagnoses, substance use, and suicidal behavior.  The more we learn about risk of re-identifying individual patients, the more careful we get about sharing sensitive data.  At some point, people who rely on our research findings will need to trust the veracity of our data and the integrity of our analyses.

We can, however, be transparent regarding shortcomings of our data and mistakes in our interpretation.  I can remember some of our own “incredible” research that could have turned into Surgisphere-style embarrassments.  For example: We found an alarmingly high rate of death by suicide in the first few weeks after people lost health insurance coverage.  But then we discovered we were using a database that registered months of complete insurance coverage – so people were counted as not insured during the month they died.  I’m grateful we didn’t publish that dramatic news.  Another example: We found a disturbingly high number of people seen in emergency department visits for a suicide attempt had another visit with a diagnosis of self-harm during the next few days.  That was alarming and disturbing.  But then we discovered that the bad news was actually good news.  Most of those visits that looked like repeat suicide attempts turned out to be timely follow-up visits after a suicide attempt.  Another embarrassing error avoided.

While we try to publicize those lessons, the audience for those stories is narrow.  There is no Journal of Silly Mistakes We Almost Made, but there ought to be.  If that journal existed, research groups could establish their credibility by publishing detailed accounts of their mistakes and near-mistakes.  As I’ve often said in our research team meetings:  If we’re trying anything new, we are bound to get some things wrong.  Let’s try to find or mistakes before other people point them out to us.

Greg Simon

Delivering on the Real Promise of Virtual Mental Healthcare

Many of our MHRN investigators were early boosters of telehealth and virtual mental health care.  Beginning over 20 years ago, research in our health systems demonstrated the effectiveness and value of telehealth follow-up care for depression and bipolar disorder, telephone cognitive-behavioral or behavioral activation psychotherapy, depression follow-up by online messaging, online peer support for people with bipolar disorder, and telephone or email support for online mindfulness-based psychotherapy

Even in our MHRN health systems, however, actual uptake of telehealth and virtual care lagged far behind the research.  Reimbursement policies were partly to blame; telephone visits and online messaging were usually not “billable” services.  But economic barriers were only part of the problem.  Even after video visits were permitted as “billable” substitutes for in-person visits, they accounted for only 5 to 10 percent of mental health visits for either psychotherapy or medication management.  Only a few of our members seemed to prefer video visits, and our clinicians didn’t seem to be promoting them.

Then the COVID-19 pandemic changed everything almost overnight.  At Kaiser Permanente Washington, virtual visits accounted for fewer than 10% of all mental health visits during the week of March 9th.  That increased to nearly 60% the following week and over 95% the week after that.  I’ve certainly never seen anything change that quickly in mental health care.  Discussing this with our colleague Rinad Beidas from U Penn, I asked if the implementation science literature recognizes an implementation strategy called “My Hair is on Fire!”  Whatever we call it, it certainly worked in this case.

My anecdotal experience was that the benefits of telehealth or virtual care included the expected and the unexpected.  As expected, even my patients who had been reluctant to schedule video visits enjoyed the convenience of avoiding a trip to our clinic through Seattle traffic (or the way Seattle traffic used to be).  Also as expected, there were no barriers to serving patients in Eastern Washington where psychiatric care is scarce to nonexistent.  It was as if the Cascade mountains had disappeared.  One unexpected bonus was meeting some of the beloved pets I’d only heard about during in-person visits.  And I could sometimes see the signs of those positive activities we mental health clinicians try to encourage, like guitars and artwork hanging on the walls.  It’s good to ask, “Have you been making any art?”.  But it’s even better to ask, “Could you show me some of that art you’ve been making?”

As is often the case in health care, the possible negative consequences of this transformation showed up more in data than in anecdotes.  My clinic hours still seemed busy, but the data showed that our overall number of mental health visits had definitely decreased.  Virtual visits had not replaced all of the in-person visits.  That’s concerning, since we certainly don’t expect that the COVID-19 pandemic has decreased the need for mental health care.  So we’re now digging deeper into those data to understand who might be left behind in the sudden transition to virtual care.  Some of our questions:  Is the decrease in overall visits greater in some racial or ethnic groups?  Are we now seeing fewer people with less severe symptoms or problems – or are the people with more severe problems more likely to be left behind?

As our research group was starting to investigate these questions, I got a message from one of our clinical leaders asking about a new use for our suicide risk prediction models.  They were also thinking about people with the greatest need being left behind.  And they were thinking about remedies.  Before the COVID-19 outbreak, we were testing use of suicide risk prediction scores in our clinics, prompting our clinicians to assess and address risk of self-harm during visits.  But some people at high risk might not be making visits, even telephone or video visits, during these pandemic times.  So our clinical leaders proposed reaching out to our members at highest risk rather than waiting for them to appear (virtually) in our clinics. 

Collaborating with out clinical leaders on this new outreach program prompted me to look again at our series of studies on telehealth and virtual care.  Those studies weren’t about simply offering telehealth as an option.  Persistent outreach was a central element of each of those programs. Telehealth is about more than avoiding Seattle traffic or Coronavirus infection.  Done right, telehealth and virtual care enable a fundamental shift from passive response to active outreach.

Outreach is especially important in these chaotic times.  During March and April, video and telephone visits were an urgent work-around for doing the same work we’ve always done.  Now in May, we’re designing and implementing the new work that these times call for.

Greg Simon

Pragmatic Trials for Common Clinical Questions: Now More than Ever

Adrian Hernandez, Rich Platt, and I recently published a Perspective in New England Journal of Medicine about the pressing need for pragmatic clinical trials to answer common clinical questions.  We started writing that piece last summer, long before any hint of the COVID-19 pandemic.  But the need for high-quality evidence to address common clinical decisions is now more urgent than we could have imagined.

Leaving aside heated debates regarding the effectiveness of hydroxychloroquine or azithromycin, we can point to other practical questions regarding use of common treatments by people at risk for COVID-19.  Laboratory studies suggest that ibuprofen could increase virus binding sites.  Should we avoid ibuprofen and recommend acetaminophen for anyone with fever and respiratory symptoms?   Acetaminophen toxicity is not benign.  Laboratory studies suggest that ACE inhibitors, among the most common medications for hypertension, could also increase virus binding sites.  Should we recommend against ACE inhibitors for the 20% of older Americans now using them daily?  Stopping or changing medication for hypertension certainly has risks.  Laboratory studies can raise those questions, but we need clinical trials to answer them with any certainty.

Pragmatic or real-world clinical trials are usually the best method for answering those practical questions.  Pragmatic trials are embedded in everyday practice, involve typical patients and typical clinicians, and study the way treatments work under real-world conditions.  Compared to traditional clinical trials, done in specialized research centers under highly controlled conditions, pragmatic trials are both more efficient (we can get answers faster and cheaper) and more generalizable (the answers apply to real-world practice). 

Pragmatic trials are especially helpful when alternative interventions could have different balances of benefits and risks for different people – like ibuprofen and acetaminophen for reducing fever.  Laboratory studies – or even highly controlled traditional clinical trials – can’t sort out how that balance plays out in the real world.

In our perspective piece, we pointed out financial, regulatory, and logistical barriers to faster and more efficient pragmatic clinical trials.  But the most important barrier is cultural.  It’s unsettling to acknowledge our lack of evidence to guide common and consequential clinical decisions.  Clinicians want to inspire hope and confidence.  Patients and families making everyday decisions about healthcare might be dismayed to learn that we lack clear answers to important clinical questions.  We all must do the best we can with whatever evidence we have, but we should certainly not be satisfied with current knowledge.  If we hope to activate our entire health care system to generate better evidence, we’ll probably need to provoke more discomfort with the quality of evidence we have now.

Inadequate evidence can also lead to endless conflict.  My colleague Michael Von Korff used to use the term “German Argument” to describe people preferring to argue about a question when the answer is readily available to those willing to look.  Michael was fully entitled to use that expression, since his last name starts with “Von.”  Germany, however, now stands out for success in mitigating the impact of the COVID-19 pandemic.  Even if Michael’s ethnic joke no longer applies, the practice of arguing rather than examining evidence is widespread.  The best way to end those arguments is to say “I really don’t know the answer.  How could we find out as quickly as possible?” 

Greg Simon

“H1-H0, H1-H0” is a song we seldom hear in the real world

Arne Beck and I were recently revising the description of one of our Mental Health Research Network projects. We really tried to use the traditional scientific format, specifying H1 (our hypothesis) and H0 (the null hypothesis). But our research just didn’t fit into that mold. We eventually gave up and just used a plain-language description of our question: For women at risk for relapse of depression in pregnancy, how do the benefits and costs of a peer coaching program compare to those of coaching from traditional clinicians? 

Our real-world research often doesn’t fit into that H1-H0 format. We aim to answer practical questions of interest to patients, clinicians, and health systems. Those practical questions typically involve estimation (How much?), classification (For whom?), comparison (How much more or less?) or interaction (How much more or less for whom?). None of those are yes/no questions. While we certainly care whether any patterns or differences we observe might be due to chance, a p-value for rejecting a null hypothesis does not answer the practical questions we hope to address.

When we talk about our research with patients, clinicians, and health system leaders, we never hear questions expressed in terms of H1 or H0. I imagine starting a presentation to health system leaders with a slide showing H1 and H0—and then hearing them all break into that song from the Walt Disney classic Snow White. Out in the real world of patients, clinicians, and health system leaders, “H1-H0” most likely stands for that catchy tune sung by the Seven Dwarfs.

As someone who tries to do practical research, my dissatisfaction with the H1-H0 format is about more than just language or appearance. There are concrete reasons why that orientation doesn’t fit with the research our network does.

Pragmatic or real-world research often involves multiple outcomes and competing priorities.  For example, we hope our Suicide Prevention Outreach Trial will show that vigorous outreach reduces risk of suicidal behavior. But we already know some people will be upset or offended by our outreach. Our task is to accurately estimate and report both the beneficial effects on risk of suicidal behavior and the proportion of people who object or feel harmed.  We may have opinions about the relative importance of those competing effects, but different people will value those outcomes differently. It’s not a yes/no question with a single answer. Our research aims to answer questions about “How much?”. Each user of our research has to consider “How important to me or the people I serve?”   

The H1-H0 approach becomes even less helpful as sample sizes grow very large.  For example, our work on prediction of suicide risk used records data for millions of patients. With a sample that large, we can resoundingly reject hundreds of null hypotheses while learning nothing that’s useful to patients or clinicians. In fact, our analytic methods are designed to “shrink” or suppress most of those “statistically significant” findings. We hope to create a useful tool that can reliably and accurately identify people at highest risk of self-harm. For that task, too many “significant” p-values are really just a distraction.

I’m very fond of how the Patient-Centered Outcomes Research Institute’s describes the central questions of patient-centered research: What are my options? What are the potential benefits and harms of those options? Those patient-centered questions focus on choices that are actually available and the range of outcomes (both positive and negative) that people care about. I think those are the questions our research should aim to answer, and they can’t be reduced to H1-H0 terms.

I am certainly not arguing that real-world research should be less rigorous than traditional hypothesis-testing or “explanatory” research. Pragmatic or real-world researchers are absolutely obligated to clearly specify research questions, outcome measures, and analytic methods—and declare or register those things before the research starts. Presentations of results should stick to what was registered, and include clear presentations of confidence limits or precision. In my book, that’s actually more rigorous than just reporting a p-value for rejecting H0. There’s no conflict between being rigorous and being practical or relevant.

To be fair, I must admit that the Seven Dwarfs song is more real-world than I gave it credit for. We often misremember the lyrics “Hi-Ho, Hi-Ho, it’s off to work we go.” But the Seven Dwarfs were actually singing “Hi-Ho, Hi-Ho, it’s home from work we go.” That’s the sort of person-centered song we might actually hear in the real world.

Greg Simon