Q-Learning Social Cue Model

RL Perspectives on the Conspecific Cue Model · Gildea et al. (2025) · R. Soc. Open Sci.
Q(s,a) · s = count of others · limit = max distinguishable
Cue Reliability

Agent parameters
α — learning rate0.005
τ — choice stochasticity0.20
N agents10

Trial structure
Acquisition trials600
Reversal trials600

Dummy cues (CCM Fig. 4)

Social parameters
Sweep independent variable:
ρ — social reward (fixed)0.00

Simulation
Ensemble runs400
Smooth window12
Configure parameters, then press ▶ Run Simulation to generate curves for each counting limit (limit) value.
Cue Reliability

Agent parameters
α — learning rate0.005
τ — choice stochasticity0.20
N agents10

Trial structure
Acquisition trials600
Reversal trials600

Dummy cues (CCM Fig. 4)

Social parameters
ρ — max social reward0.00
Counting limit

Animation
Speed (trials/sec)5
Configure parameters, then press ▶ Play or ▸ Step to run a single trial sequence.
From Rescorla–Wagner to Q-learning

Behavioral scientists and engineers arrived at the same learning rule from different directions. In Rescorla–Wagner, associative strength updates in proportion to prediction error — the gap between reward received and reward predicted. In Q-learning, an agent maintains Q(s, a): the expected reward for choosing action a in situation s. The update rule is identical: Q(s,a) ← Q(s,a) + α·δ, where δ = r − Q(s,a). The formal connection is exact — Rescorla–Wagner is Q-learning without temporal discounting.

What Q-learning makes explicit is the state s — everything the agent observes before choosing. Standard associative learning models populate s with environmental cues alone. The Conspecific Cue Model (CCM; Gildea et al. 2025) proposes that co-presence of conspecifics at a food site is just another cue in s. This widget asks the same question using tabular Q-learning: each option's state is the observed social co-presence capped at the agent's counting limit. Each (option, perceived-count) pair gets its own Q-value, updated by the standard TD rule Q(s,a) ← Q(s,a) + α·δ. When limit=0 all options share state 0 and the agent is a pure individual learner with one Q-value per option — no weights, no gradient, no λ.

The model equations
counta integer number of other agents (and dummies) at option a on the previous trial (simultaneous decisions; previous trial used as proxy) sa perceived count at option a: sa = min(counta, limit) — the raw headcount capped at the animal's counting ceiling limit counting limit — maximum number of conspecifics the agent can individually distinguish. limit=1: only presence/absence; limit=∞: full resolution up to group size Q(s, a) expected reward for choosing option a when s others are perceived there — one table entry per (option, perceived-count) pair, all initialized to 0 ρ ∈ [0,1] social reward per perceived conspecific — each conspecific the agent can distinguish at its chosen option adds ρ to the effective reward, up to the counting limit. ρ=0: no social attraction; ρ=1: each perceived rat is worth as much as food. With large groups and high counting limits, social reward can dominate task reward. α learning rate τ choice stochasticity (softmax temperature)
sa = min(counta, limit) perceived count — raw headcount capped at counting limit δ = (r + ρ·schosen) − Q(schosen, chosen) prediction error; r ∈ {0,1} task reward; ρ·schosen social reward (ρ per perceived conspecific) Q(schosen, chosen) ← Q(schosen, chosen) + α·δ standard tabular Q update — identical to R–W at γ=0 full form: r + γ·max Q(s′,a′) − Q(s,a), γ=0 here episodic trials — one choice, one reward, no future states to discount P(a|s) = exp(Q(sa, a)/τ) / Σ exp(Q(sa′, a′)/τ) softmax choice rule; replaces CCM's horse-race timing

Each agent maintains its own Q-table and receives its own reward — no communication, shared reward, or centralized training. Social information enters only through the state bucket.

The dummy conditions (CCM Figure 4)

Two control conditions ask whether any group-training advantage requires specifically social information, or arises from having more cues of any kind.

Fixed dummies — k non-learning agents permanently assigned to fixed positions and never moving. By default they are split roughly equally between options, adding consistent but uninformative social cues. If fixed dummies produce curves similar to social agents, the group advantage is not unique to social information — any additional consistent state signal would do.

Simulator extension beyond CCM: dummy placement can be set explicitly — Equal (default, half at each option), Correct (all dummies always at the currently rewarded option, simulating knowledgeable conspecifics), or Wrong (all dummies at the unrewarded option, simulating misinformation). The Correct condition tests whether genuinely informative social cues — rather than merely consistent ones — produce a different learning advantage.

Random dummies — k non-learning agents that independently pick a random option each trial, uncorrelated with reward. Their shuffling corrupts the perceived count sa, diluting its reliability. Degraded performance confirms that the value of social cues depends on their consistency, not merely their presence.

Correspondence to CCM (Gildea et al. 2025, R. Soc. Open Sci.)
Q-learning (this widget)CCM
Q(sa, a)ΣV at option a
Expected reward / total associative strength
Q(0, a) (s=0 bucket)V(Ni)
Value at zero social context ≈ env-only strength
Q(s, a) for s>0V(Ni)+V(Si)
Value at non-zero social context; interaction effects captured automatically
countacount of conspecifics
CCM notionally assigns a separate weight per individual; in practice all are interchangeable
limitβsocnon (approx.)
Both govern social influence; here a perceptual ceiling, not a learning rate ratio
α·δ (tabular update)β[λ−ΣV] (R–W)
Identical prediction-error logic at γ=0; no separate weights or gradient
Softmax (τ)Horse-race timing
Both convert learned values to probabilistic choice

Key difference from CCM: CCM uses separate learning rates βsoc and βnon, conflating perceptual salience with learning rate. Here the counting limit is a purely perceptual parameter — it determines how finely the animal can distinguish social density, with no direct effect on the learning rate itself. The Q-update (α·δ) is the same regardless of which state is visited. There are no separate social weights and no decomposition. Importantly, different (option, perceived-count) pairs get fully independent Q-values, so interaction effects between social context and option identity are captured automatically without any special machinery.
CCM is a standard RL model in disguise

The Conspecific Cue Model (CCM; Gildea et al. 2025) was developed within the associative learning tradition, building on Rescorla–Wagner. But its mathematical structure is identical to a well-known class of reinforcement learning model — one that places it in direct dialogue with the tabular Q-learning model simulated in the other tabs of this widget.

Both approaches are variants of Q-learning. What distinguishes them is how the Q-function is represented — the prediction-error update rule and Bellman target are the same throughout. This gives a spectrum:

  • Tabular (other tabs) — one free parameter per (state, action) pair; maximally flexible, no structural assumptions, no generalization across states.
  • Linear Function Approximation / Linear FA (this tab; equivalent to CCM) — Q is a weighted sum of observable cues; generalizes across states, but assumes cue contributions are additive and independent.
  • Nonlinear FA (e.g., deep Q-networks) — a neural network approximates Q; can represent interactions and complex structure, at the cost of additional assumptions and training instability.

The tabular model explored in this widget encodes social context directly in the state, giving each (option, perceived-count) pair its own independent Q-value. CCM instead represents value as a learned weighted sum of environmental and social cues — exactly the Linear FA structure. The additivity assumption this entails (that V(N) and V(S) contribute independently) is precisely what CCM makes when it sums the two associative strengths. Recognizing this makes that assumption explicit and testable, and connects CCM to a large literature on when linear approximations succeed or fail.

In CCM, each option i is described by two observable quantities: Ni, the environmental cue (shape, light, etc.), and Si, the social co-presence signal (how many conspecifics are there). The animal learns a separate associative strength for each — V(N) and V(S) — and the total predicted value of an option is their sum:

Ni = 1 (env cue present) environmental cue at option i (always present) — using CCM's own notation Si = count of conspecifics at option i social co-presence signal at option i Q(i) = wenv·Ni + wsoc·Si predicted value of option i; wenv and wsoc are single shared weights learned from experience δ = r − Q(chosen) prediction error — identical to Rescorla–Wagner's λ − ΣV wenv ← wenv + βnon·δ·Nchosen env weight update; Nchosen=1 always, so this is just βnon·δ wsoc ← wsoc + βsoc·δ·Schosen social weight update; scales with how many conspecifics were at the chosen option

Both weights start at zero and are updated every trial by the same prediction-error rule. This is standard Linear FA Q-learning — the weights are the animal's learned beliefs about how predictive each type of cue is, and those beliefs are revised whenever reality deviates from expectation.

Term-by-term correspondence
CCM (Gildea et al. 2025)Linear FA / RL
Nienv cue at option i
Always 1 in both models (cue is present or absent)
Sisocial co-presence at option i
Count of conspecifics present; same quantity, same role
V(Ni) = wenv·Niwenv·Ni
Associative strength of the env cue — identical expressions
V(Si) = wsoc·Siwsoc·Si
Associative strength of social co-presence — identical expressions
ΣV = V(N)+V(S)Q(i) = wenv·Ni + wsoc·Si
Total predicted value — identical
λ − ΣVδ = r − Q(chosen)
Prediction error — identical; λ in R–W = reward magnitude r in RL
βnon·δ·Niβnon·δ·Ni
Env weight update — identical; Ni=1 so reduces to βnon·δ
βsoc·δ·Siβsoc·δ·Si
Social weight update — identical; asymmetric rates are standard RL practice
Horse-race timing ruleSoftmax action selection
Both convert learned values into probabilistic choice

The correspondence is exact. CCM's βsocnon asymmetry is not a structural departure from RL — separate learning rates per cue type are a standard implementation choice in Linear FA, often used when some cues are believed to be more volatile or salient than others.

Additional flexibility in the RL framing: Linear FA also supports option-specific weights — wenv[i] and wsoc[i] that can differ across options, allowing the animal to learn that social cues at option A are more predictive than at option B. CCM uses shared weights (one wenv, one wsoc for all options), which is the simpler and more parsimonious choice. The per-option extension is straightforward and may be worth exploring empirically.
What the RL framing adds

Recognizing CCM as Linear FA immediately connects it to a large literature on when this class of model works well and when it doesn't — and clarifies how it differs from the tabular Q model simulated in the other tabs:

1. The additivity assumption and interaction effects. The CCM-analogous Linear FA model represents value as a weighted sum of independent cues. Environmental and social cues contribute independently, with no way to represent interactions. A single shared wsoc summarizes how reward-predictive social co-presence has been on average — it cannot learn that the same social cue means different things in different circumstances. The tabular Q model simulated in the other tabs does not have this constraint: each (option, social-count) pair gets its own independent Q-value, so interaction effects are representable in principle.

Consider the reversal task. During acquisition, many conspecifics at option A reliably signals reward. During reversal, the crowd is still at A for the first several trials — now as a misleading signal. A CCM/Linear FA agent must reverse the sign of wsoc to accommodate the flip, which is slow because the same weight is fighting against a well-learned history.

The tabular Q model in this widget handles the same situation differently. As the crowd shifts during reversal, the focal agent increasingly encounters social counts it rarely saw during acquisition — states with little learned history and therefore fast-updating Q-values. The redistribution of conspecifics effectively moves the agent into a novel region of experience, allowing the new contingency to be learned quickly. This is the tabular model's key structural advantage: each (option, social-count) pair is a genuinely independent context, not a single shared weight.

True interaction effects — where the value of a social cue depends on which environmental cue is present — are actually representable in the tabular Q model on the other tabs, because each (option, social-count) pair gets its own independent Q-value with no additivity constraint. CCM/Linear FA cannot represent such interactions because wsoc is a single number that applies regardless of context. In the current task design this distinction does not arise — the environmental cue is always present at each option, so there is no env-cue variation to interact with. But in a richer task where environmental cues varied across trials, the tabular model could learn that "social cues matter when the env cue is also present, but not otherwise" — a conjunction CCM cannot express. Whether animals represent such conjunctions is an empirical question the reversal paradigm could probe with appropriate modifications.

2. What "salience" means mechanically. In CCM, βsoc controls both how fast social associations are learned and how strongly they influence behavior — these are conflated in a single parameter. In the Linear FA framing they separate naturally: the learning rate governs update speed, while the weight magnitude governs influence on Q(i). The tabular Q model sidesteps this question entirely: there is no salience parameter, only a counting limit that determines what the animal can perceive. What gets learned follows from what gets observed.

3. The state augmentation question. Linear FA and CCM ask: given that social cues are present, how do their learning dynamics compare to environmental cues? The tabular Q model in this widget asks a more fundamental prior question: does having access to social context at all — at whatever granularity — change what the agent can learn? The counting limit manipulation probes this directly, asking how coarsely social context can be perceived before its benefits disappear. These are complementary questions, not competing ones.

Do they predict the same things?

For the core CCM prediction — a reversal advantage for collectively-trained animals — the two models agree qualitatively but for different reasons.

In CCM / Linear FA, the reversal advantage emerges from wsoc acting as an early-warning signal: socially-informed animals begin updating when they observe their conspecifics shifting toward the new correct option, before environmental prediction errors alone would drive learning. The asymmetric βsoc rate amplifies this effect.

In the tabular Q model, the reversal advantage depends on state-space structure. Social context provides discriminating states that shift as conspecifics redistribute after reversal, moving agents into less-familiar territory where Q-values update quickly. Coarser counting limits reduce this effect — predicting that animals with limited social numerosity show weaker reversal advantages, a prediction CCM does not make.

Both models predict that individually-trained animals (no social context) show slower reversal. But the tabular model additionally predicts a non-monotonic effect of social perception granularity: very fine-grained social perception can hurt reversal by over-specifying social states and building up rich history that must be overwritten (visible in the Learning Curves tab at limit=∞). CCM has no analog to this prediction.

© 2026 Theodore P. Pavlic · MIT License