Modeling Avoidance in Mood and Anxiety Disorders Using Reinforcement Learning

Background Serious and debilitating symptoms of anxiety are the most common mental health problem worldwide, accounting for around 5% of all adult years lived with disability in the developed world. Avoidance behavior—avoiding social situations for fear of embarrassment, for instance—is a core feature of such anxiety. However, as for many other psychiatric symptoms the biological mechanisms underlying avoidance remain unclear. Methods Reinforcement learning models provide formal and testable characterizations of the mechanisms of decision making; here, we examine avoidance in these terms. A total of 101 healthy participants and individuals with mood and anxiety disorders completed an approach-avoidance go/no-go task under stress induced by threat of unpredictable shock. Results We show an increased reliance in the mood and anxiety group on a parameter of our reinforcement learning model that characterizes a prepotent (pavlovian) bias to withhold responding in the face of negative outcomes. This was particularly the case when the mood and anxiety group was under stress. Conclusions This formal description of avoidance within the reinforcement learning framework provides a new means of linking clinical symptoms with biophysically plausible models of neural circuitry and, as such, takes us closer to a mechanistic understanding of mood and anxiety disorders.

. Self-report symptoms split by sub diagnosis.

Additional Task Details
The fractal cue, target detection task and the outcome were each presented for 1000ms and separated by a 250ms inter-trial interval (ITI). Each fractal cue signified one of the four experimental conditions, but this was not made explicit at the start of the experiment. Thus, subjects had to learn that each fractal image indicated both which 1) action (go=make response; no-go=withhold response) to perform during the target detection task and 2) the associated valence of the outcome (reward/no reward; punishment/no punishment). The meaning of the fractal cues was randomized across participants. In the target detection task, a circle was presented randomly on one side of the screen (50% of trials on the left). In the go experimental conditions (GW/GA), participants had to match the position of the circle by pressing the corresponding key (i.e., press the left key when the circle was on the left and vice versa). In the no-go experimental conditions (NGW/NGA), participants had to withhold any response (i.e. any response was recorded as incorrect). The circle was presented for 1000ms regardless of response.
In the rewarded conditions (GW/NGW), correct responses were rewarded 80% of the time, but resulted in no win 20% of the time. Incorrect responses led to no win 80% of the time, but were rewarded 20% of the time. In the punishment conditions (GA/NGA), correct responses avoided punishment 80% of the time but led to a loss 20% of the time (and vice versa for incorrect responses). Wins were indicated by a happy face and a gain of 10 points. Losses were indicated by a fearful face and a 10 point deduction. These were purely hypothetical within the structure of the task (i.e. they did not translate into a financial bonus). A horizontal yellow bar indicated when participants neither won nor lost points. Faces were selected from the Ekman facial set and the genders of the faces were counterbalanced across participants.
Participants were informed about the probabilistic nature of the task but they were not told the action-outcome contingencies for each fractal cue. Instead, they were told that they had to learn the correct response for each fractal cue, which could be either a go response or a no-go response, by trial and error. The task was divided into 24 alternating safe and threat blocks (12 blocks of each) with the order of the safe and threat conditions counterbalanced across participants. A different set of fractal cues was used under threat and safe in order to avoid possible confounding effects from learning under the different conditions. The eight fractal cues for threat and safe (four in each condition) were counterbalanced across participants.
Each block had five trials per experimental condition (GW, GA, NGW, NGA), with a total of 20 trials per block. The trials were randomly presented within each block. There were thus a total of 240 trials (60 trials for each fractal cue) per safe or threat condition. The task lasted around 35 min with a single shock delivered in the third, seventh, tenth and twelfth threat blocks. These shocks were always presented in the ITI between trials (the 4 th trial of the 3 rd threat block, the 18 th trial of the 5 th threat block, the 10 th trial of the 10 th threat block and the 2 nd trial of the 12 th threat block). Critically, these shocks were presented to maximise manipulation efficacy (6) (see analysis of effect on choice behavior below). Prior to the start of the task, participants completed nine practice trials without the threat manipulation. Each outcome appeared three times and identical black images were used instead of fractal cues in order to familiarise participants with the task without confounding learning of the action-outcome contingencies.

Effect of Shocks
Comparing pooled performance on the five trials before and after the four shocks (i.e., 20 preand 20 post-shock trials) revealed no impact of the shock stimulation on accuracy (pre-vs post-

Post-hoc Correlational Analysis
If we test the hypothesis that increased mood and anxiety symptoms are associated with increased avoidance we see a weak one-tailed trend towards a relationship with trait anxiety

Model Inspired Basic Analysis
One of the main points of the model is to focus the behavioral data across all conditions precisely onto psychologically-meaningful parameters. However, it is often possible to use the newfound understanding to discern echoes of the same effects in more direct analyses. In our case, the key parametric difference concerned the avoidance bias, which is expected to have a particularly strong negative effect on performance in the GA condition once learning has progressed far enough to arrange for a sufficiently negative value for the state (and thus a sufficiently strong nogo influence). Indeed, focussing on the final two quartiles of trials, there was a time*condition*group interaction on GA accuracy (F(1,99)=4.5, η p 2 =0.04, p=0.036).

Model Fitting
Fitting our winning model using a hierarchical Bayesian approach implemented using the hBayesDM (hierarchical Bayesian modeling of Decision-Making tasks) toolbox (7)  We note that our model does a better job of fitting the trials that contribute to the avoidance bias parameter fitting (i.e. the avoid trials; Figure 4a) than the rewarded trials (especially NGW). This means that inference is based on the trials that are best captured by the model. Future work might seek to refine model components that improve the fit of the rewarded trials.