If you’re my age, “The Kids Are Alright” names a song by The Who from their 1966 debut album, My Generation. If you’re a generation younger, it names a movie from 2010. If you’re even younger, it names a TV series that debuted just last year. What goes around comes around – especially our worry that the kids are not alright.
That worry often centers on how the kids’ latest entertainment is damaging their mental health. In the 1940s, the villain was comic books. Prominent psychiatrist Fredric Wertham testified to the US Congress that “I think Hitler was a beginner compared to the comic book industry.” In the 1960s and 1970s, it was the kids’ music – especially songs like “My Generation” with lyrics like “I hope I die before I get old.” In the 2010s, our concern is about screen time – especially time kids spend on social media and gaming. Last year, the World Health Organization added Gaming Disorder to the addictions chapter of the International Classification of Diseases.
Evidence regarding screen time and kids’ mental health is mixed – both in quality and conclusions. Some have pointed to population increases in screen time alongside population increases in suicidal behavior, but that population-level correlation may just be an example of the ecological fallacy. Some large community surveys have found an individual-level correlation between self-reported screen time and poorer mental health. But that correlation could mean that increased screen time is a cause of poorer mental health, a consequence of it, or just a coincidence. A team of Oxford researchers used data from actual logs of screen time and used more sophisticated analytic methods to account for other differences between heavier and lighter device users. Those stronger methods found minimal or no correlation between screen time and teens’ mental health.
A definitive answer would require a true experiment – randomly assigning thousands of kids to higher or lower levels of screen time. That will certainly never happen. If we are left with non-experimental or observational research, then we should ask a series of questions: Did the researchers declare or register their questions and analyses in advance? How representative were those included in the research? How accurately were screen time and mental health outcomes measured? How well could the researchers measure and account for confounders (other differences between frequent and less frequent device users that might account for mental health differences)? We should also recall a general bias in mental health or social science research: Findings that raise alarm or confirm widely-held beliefs are more likely to be published and publicized.
Nevertheless, increasing rates of youth suicide are a well-established fact. And community surveys do consistently show rising levels of distress and suicidal ideation among adolescents. And I don’t think we’ve seen any evidence that time on social media or gaming has a positive effect on mental health or suicide risk.
The evidence to date raises questions but doesn’t give clear answers. For now, I’ll remain cautious – both in what I’d recommend about screen time and what I’d say about the data behind any recommendation. It’s certainly prudent to recommend limits on screen time. But it’s not prudent to claim much certainty about cause and effect. Given the evidence we have, any headline-grabbing alarms would not be prudent. I’d prefer not to be remembered as the Dr. Frederic Wertham of my generation.
I cheered about a talented and successful woman announcing that she is not ashamed to live with bipolar disorder. But I initially paused when reading the words “I’m bipolar.” That’s the sort of language that many mental health advocates discourage. I’ve been trying to unlearn language like that.
In the past, our language about mental health conditions often seemed to reduce people to diagnoses. We used to say (without thinking) that a person “is bipolar” or “is schizophrenic. Or, even more reductionistic, we would refer to someone as “a bipolar” or “a schizophrenic.”
The disability rights community has sought to overcome such stereotyping by encouraging “people-first language.” The point is that people with a health condition or disability are – first and foremost – people. At the same time that someone is a person with bipolar disorder, they’re also a person with an interesting job. Or a person with an adorable cat. Or a person with a fantastic singing and song-writing talent. Or a person with a neighbor who uses the leaf blower too early on Sunday morning. People who have bipolar disorder have lots of other ordinary and extraordinary things in their lives, just like other people do.
But it’s really none of my business how Bebe Rexha wants to name or define herself. If she’s trying to break down stigma, then it may be helpful to confront it as directly as possible. Many marginalized groups have made a point of re-claiming or re-purposing labels that were historically pejorative or stigmatizing. The “Mad Pride” movement of mental health consumers is one example, re-claiming words like “mad” and “nuts” to disrupt stigma and cast off shame.
As for me, though, I’ll stick with the people-first language. I’ll be careful to say that a person has bipolar disorder or lives with bipolar disorder. And if one of my patients says that they are bipolar, I might even explain why I don’t say it that way. But I certainly wouldn’t tell them – or Bebe Rexha – what words to use. After all, Kaiser Permanente’s anti-stigma campaign is called “Find Your Words”, rather than “Use Our Words”. Bebe Rexha and anyone else can use their own words to talk about their own mental health. The right words are the ones that make it easiest to speak up.
It’s the time of year when backyard gardeners start to think about transplanting tomato seedlings from that tray in the sunny part of the kitchen to the real garden outside. Moving outdoors too soon is risky. Those little seedlings could get beaten down by the cold rain, nibbled by the escaped Easter bunnies running wild in my neighborhood, or decimated by cutworms. Gardeners in the Northeast and Midwest even need to worry about late-season snow. But you don’t want to wait too long, or you’ll end up with root-bound, leggy plants and a disappointing tomato crop.
Those of us who develop mental health interventions often keep them indoors too long. When I look back on the history of Collaborative Care for depression, I think we waited too long before moving that intervention to the outdoor garden. We completed four trials of Collaborative Care within Group Health Cooperative (now Kaiser Permanente Washington) before the large, multi-site IMPACT trial-tested Collaborative Care across the great outdoors of eight health systems. Perhaps IMPACT could have happened five years (and two randomized trials) earlier.
For those who develop new treatments or programs, it’s only natural to want to keep those delicate young plants indoors. Spring weather is unpredictable, and a promising new treatment or program might fail for random reasons. We might relax our over-protectiveness if we could learn more from those “failures”. I’ll again cite the IMPACT trial as an example to follow. Collaborative Care for depression proved effective across a variety of health systems, patient populations, and enrollment strategies. But that consistency was not a presumption or even a goal. If Collaborative Care had “failed” in some settings, we could have learned valuable lessons for future implementation. If we intentionally plant our tomato seedlings in all sorts of conditions, some might not flourish. But we might learn things we can implement next spring.
This post will be a nerdy one. I want to split some hairs about use of the word “predict”. But I think they are hairs worth splitting.
By the dictionary, prediction is always about the future. But we researchers sometimes use the word more loosely. We might say that “A is a significant predictor of B” even when A and B occur at the same time. For example, an evaluation of screening to identify suicidal ideation might report how well screening questions “predict” risk assessed by an expert clinician during the same visit. That’s really an example of detection or diagnosis rather than prediction. But I don’t expect that my hair-splitting will end use of the word “predict” to describe what’s really a cross-sectional association or correlation. It’s a longstanding practice, and saying that “A predicts B” just sounds more important than “A is associated with B.” Still we should remember Yogi Berra’s point that predicting the future is a very different task from predicting the present.
Yogi’s wisdom is especially relevant to our interpretation of Positive Predictive Value or PPV. Ironically, the concept of PPV was originally developed for a scenario that is not, by the dictionary definition, prediction. Instead, PPV was intended to measure the performance of a screening or diagnostic test. Traditionally, PPV is defined as the proportion with a positive test result who actually have the condition of interest – not the proportion who will develop that condition at some point in the future. The PPV metric, however, can also be used to measure the performance of a prediction tool. Those are two very different tasks, and our thresholds for acceptable PPV would be quite different. Consider two examples from cardiovascular disease. If we are evaluating the accuracy of a rapid troponin test to identify myocardial infarction (MI) in emergency department patients with chest pain, we’d expect that most patients with a positive troponin test to have an MI by a confirmatory or “gold standard” test. But if we are using a Framingham score to predict future MI, we would not require nearly so high a PPV before recommending treatment to reduce risk. In fact, most guidelines recommend cholesterol-lowering medications when predicted 10-year risk of MI is over 10%.
The utility of any PPV depends in part on overall prevalence. If 25% of emergency department patients with chest pain are having an MI, then a PPV of 30% is hardly better than chance. But if the average risk of MI is only 1%, then a PPV of 10% could be quite useful. And our threshold for PPV also depends on the intervention we might recommend. We’d likely require a higher PPV before recommending angioplasty than we would for recommending statins. Predictions about the future more often involve rarer events and less intensive interventions.
I think that confusion about these different uses of PPV has led to exaggerated pessimism regarding tools to predict suicidal behavior. In that case, our goal is actual prediction – predicting a future event rather than detecting or diagnosing something already present. Current prediction models can accurately identify people with a 5% or 10% risk of suicide attempt within 90 days. That would be a poor performance for a diagnostic test, but it’s certainly good enough actual prediction. A 5% risk over 90 days is a much higher risk threshold than we use to prescribe statins to prevent MI or even anticoagulants to prevent stroke in atrial fibrillation.
Translating Yogi Berra’s wisdom into technical terms, we might say: “We would accept a much lower PPV for a prediction model than we would for a diagnostic test.”
What about that other famous Yogi statistical quote: “Ninety percent of the game is half mental.” Was he trying to explain the difference between mean and median in a skewed distribution? You never know…
Seattle had one of its rare snowy days last week. Seeing cars slide sideways down our hills reminded me that friction is sometimes our friend.
The modern information economy is all about eliminating data friction. Extracting, combining, and analyzing all types of data get faster and cheaper by the day. As data friction disappears, it seems that nearly any information about nearly anyone is available for purchase somewhere. Since Google controls my phone, Google can sell the information that I go backcountry skiing before sunrise and follow Basset Hounds on Instagram. That doesn’t trouble me. I’m not sure if backcountry-and-Bassets places me in any marketing sub-demographic. But I’m sure my phone knows things about me that I would not want Google to sell. And some people might consider the backcountry-and-Bassets combination to be sensitive information.
Our research group has had some interesting discussions about whether our models to predict suicidal behavior would be even more accurate if we included information from credit agencies. There is no technical or regulatory friction stopping us – or even slowing us down. Credit agency data are readily available for purchase. I suspect many large health systems already use credit or financial data for business purposes. Linking those financial data to health system records is a minor technical task. Machine learning tools don’t care where the data came from, they just find the best predictors. I would expect that information regarding debt, bankruptcy, job loss, and housing instability would help to identify people at higher risk for suicide attempt or suicide death. Whether credit data would improve prediction accuracy is an empirical question, and empirical questions are answered by data. Whether we should use credit data at all is not an empirical question. Even if linking medical records to credit agency data could significantly improve our prediction of suicidal behavior, we may decide that we shouldn’t find out. As data friction disappears, our ethical choices become more consequential.
I think that recent discussions in the medical literature and lay press about the perils of artificial intelligence in health care have mixed two different concerns. One concern is that faulty data or inappropriate methods might lead to inaccurate or biased prediction or classification. The other is that prediction or classification is intrusive and undermines a fundamental right to privacy. We would address the former concern by trying to improve the quality of our data and the sophistication of our prediction models – by further reducing friction. But we would address the latter by slowing down or stopping altogether.
Driving on streets free of friction demands close attention. If I can’t count on friction to save me from risky driving decisions, I’ll need to look far ahead and anticipate any need to slow down, stop, or change direction.
“Helicopter Parents” is the derisive term for those over-protective parents who won’t let their children experience any failure – or even any actual challenge. The helicopter critique is supported by anecdotes of parents overly involved in children’s homework or parents intervening too vigorously when children have disappointing experiences in school or sports. “Participation Trophies” given to every member of a sports team are held up for special ridicule.
I’m actually ambivalent about the helicopter parent critique. Self-reliance is certainly a good thing. But I doubt it’s possible for a child to be too much loved or encouraged.
I think, however, that the helicopter critique does apply to the way we researchers often over-protect our theories. Loving our theories too much, we protect them too vigorously from empirical challenge. If we start with the aim of supporting theory rather than testing it, then our research design and methods emphasize orthodoxy over rigor. The result is comfortable confirmation without any real learning.
Intervention research seems especially prone to misplacing the theory cart before the evidence horse. Theory is often cited as preliminary evidence that an intervention will prove effective, thereby justifying research designed to confirm theory rather than test it. A subsequent confirmatory finding is welcomed as “positive”. A “negative” finding more often prompts minor adjustments in methods rather than altering or abandoning the motivating theory – because the research was designed to coddle the theory rather than reject it. That’s the research equivalent of a Participation Trophy – all theories must be winners just for having tried.
Theory was made to be broken – or at least rigorously tested. Some of our theories will stand up to rigorous challenge. Some will reveal their limitations and need for modification. Some may just collapse. Let’s really try to find out.
Last month I moved to a temporary office to make way for painting and installing new carpet. Being forced to pack up every book and file folder was actually a good thing. I recycled stacks and stacks of paper journals. Saving those was a habit left over from the last century. I cleaned out file drawers full of old paper grant proposals and IRB applications, relics from the days before electronic submission. I purged stacks and stacks of paper reprints (remember those?) from the 1990s. I recycled books I hadn’t opened in 15 years. Tidying up did feel good, but my purging frenzy paused when I came to this stack of books from my residency days.
Those old books were a greatest hits collection from some of the legends of psychoanalysis and psychoanalytic psychotherapy. Some of the authors were my own teachers (Jerry Adler, Terry Maltsberger). Some were the heroes of my teachers (Heinz Kohut, Michael Balint, Otto Kernberg, Elvin Semrad). Thirty years ago, those books were central to my thinking, or even my identity. But I hadn’t opened any of them since my last office move in 1999.
I paged through a few of them, trying to remember what had been so valuable back then. From my current perspective, I can’t say that psychoanalysis, or even psychoanalytic psychotherapy as practiced in 1980s Boston, can really succeed as a public mental health strategy. As a care delivery model, psychoanalysis scores poorly on all three of my “three As” tests. It’s not really affordable. It’s unlikely to ever be widely available. And it’s not always acceptable to the diverse range of people who live with mental health conditions.
But there is still some valuable wisdom in those old books. Studying Kohut certainly helped me to understand how a fragile sense of self disrupts emotion regulation and how persistent empathy can help to rebuild self-regulating capacity. And studying Kernberg helped me to understand the nonlinear relationship between the internal mental world and external reality. The persistent outreach and care management interventions that we now develop and evaluate do owe a significant debt to those psychoanalytic legends – and most especially to Kohut and Semrad.
So my psychoanalytic books survived this office move. To borrow some psychoanalytic terms, they continue to serve as healthy introjects. I’ll keep holding on to them as transitional objects.
Hans, the Calculating Horse, was a sensation of the early 1900s. He appeared to be able to count, spell, and
solve math problems – including fractions!
Only after careful investigation did everyone learn that Hans was just responding
to unconscious nonverbal cues from his trainer.
Hans couldn’t actually calculate, but he could sense the answer his
trainer was hoping for.
But even if Hans could actually do arithmetic, a true
calculating horse would still be no more than a novelty act. Even in the early 1900s, we had much more
accurate tools for calculating. And we
had (and still have) much more impressive and useful work for horses.
Some applications of machine learning in mental health seem (at
least to me) similar to a performance by Clever Hans. They may involve “predicting” something we
already believe and want to have confirmed.
Or they may develop and apply a complex tool when existing simple tools
are more than adequate. We can use big
data and neural networks to mine social media posts or search engine queries
and discover that depression increases in the winter. But that’s not news – at least for those of
us living in Seattle in November.
As with calculating horses, we also have more impressive and
useful work for machine learning tools to accomplish. I think that our work on prediction of
suicidal behavior following outpatient mental health visits is a good example. Traditional clinical assessments are hardly
better than chance. Simple
questionnaires are certainly helpful, but still lack in sensitivity and ability
to identify very high risk. There’s
clearly an unmet need. Prediction models
built from hundreds of candidate predictors can accurately identify patients
who are at high or very high near-term risk for suicidal behavior. Our health systems have decided that those
predictions are accurate enough to prompt specific clinical actions.
I think we could list some general characteristics of problems
suitable for machine learning tools. In
practical terms, we are interested in problems that are both difficult (i.e. we
can’t predict well now) and important (i.e. we can do something useful with the
prediction). In technical terms,
prediction models will probably be most useful when the outcome is relatively
rare, the true predictors are diverse, and the available data include many
potential indicators of those true predictors.
In those situations, complex models will usually outperform simple
decision rules fit for human (or equine) calculation.
Horses are capable of many marvelous things. I’d pay to see performances by racing horses or jumping horses. And especially mini-horses. But counting horses or spelling horses….not so much.
Most of our measures and measurement tools were created in
conference rooms or conference calls dominated by older white men – like me, I
guess. Over time, those “expert opinion”
measures acquire a patina of authority. As
time passes, we can start to equate familiarity or habit with accuracy or
validity. Like the biblical story about
worshipping a false idol that we created ourselves, we start to see a gold
standard rather than just a statue of a golden calf.
When we have questioned some specifics of the Antidepressant
Medication Management measures, we sometimes hear that these measures are
well-validated and time-tested. I’m
tempted to respond, “But I was one of the people who made them up in the first
place! And now I know better!”
We could tell a similar story about nearly any of our
measures or metrics in mental health:
diagnostic criteria, assessment questionnaires, structured interviews,
quality metrics. We created them using
everything we knew at the time. And
(Good news!) now we know more.
We shouldn’t forget the value of continuity in measuring the quality or outcomes of mental health care. Tracking improvement in care processes over time is a defining characteristic of learning health systems, and changing measures can interfere with our ability to accurately measure improvement. But rather than hold on to measures based on what we knew back then, our improvement efforts might be better served by using improved new measures. When possible, we should apply new standards to old data when comparing our past and current performance.
While I try to maintain a healthy skepticism about any specific measure, I am not at all skeptical about the importance of measurement. Challenging the specifics of any measure should be intended to improve accountability rather than avoiding it. I firmly believe that we can only improve what we choose to measure. And our measures should be one of the first things we aim to improve.
My wife handed me a recent issue of The New Yorker and recommended the Shouts
and Murmurs column. It parodied a
whistle-blowing data scientist testifying before Parliament about modern
analytic methods. He grows increasingly
frustrated as legislators can’t follow his explanations of eigenvectors and
At first read, I didn’t think it was very funny. Then I realized: If you don’t think Shouts and Murmurs is very funny, then it’s probably about you.
Much of our work is dimensionality reduction, even if we don’t call it that. Models to predict suicidal behavior or outcomes of depression treatment are all about reducing tens or hundreds of characteristics to a single probability. Old-fashioned regression models are also a tool for dimensionality reduction; they just typically consider a smaller number of dimensions. Moving from the statistical to the clinical, diagnoses are also dimensionality reducers. For example, DSM criteria for the diagnosis of a depressive disorder take nine diverse characteristics and summarize them as a single classification. Going farther back in our psychological history, Sigmund Freud’s The Interpretation of Dreams was all about reducing dimensionality – explaining the wild diversity of human mental life in terms of a few basic instincts and countervailing defenses.
So it should not surprise us that objections to modern statistical dimensionality reduction echo older objections to diagnostic or psychoanalytic dimensionality reduction. Human beings are not one-dimensional. Or even two-dimensional. Our wide experience of joy, pain, hope, loneliness, passion, fear, and tenderness just cannot be contained in a single mathematical model or even a few instinctual drives.
Still, I put more stock in statistical dimensionality
reduction than diagnostic or psychoanalytic dimensionality reduction. To generalize, I’m skeptical about any reductionism
determined by human “experts”. When we
humans try to simplify complex reality, we too often over-reach. Statistical dimensionality reduction tends to
be more realistic, or even humble. Our
mathematical models only claim to explain or predict a single dependent
variable. A statistical model to predict
likelihood of psychiatric hospitalization makes no claims to predict success in
relationships or finding meaning in life.
Predicting risk of hospitalization is practically useful – in a
one-dimensional way. Let’s leave it at
You may be about to ask, “Then what is an eigenvector?” I’ll pass on that one.