Task representations in neural networks trained to perform many cognitive tasks

doi:10.1038/s41593-018-0310-2

. 2019 Feb;22(2):297-306.

doi: 10.1038/s41593-018-0310-2. Epub 2019 Jan 14.

Task representations in neural networks trained to perform many cognitive tasks

Guangyu Robert Yang ^{1

2}, Madhura R Joglekar ^{1

3}, H Francis Song ^{1

4}, William T Newsome ^{5

6}, Xiao-Jing Wang ^{7

8}

Affiliations

¹ Center for Neural Science, New York University, New York, NY, USA.
² Mortimer B. Zuckerman Mind Brain Behavior Institute, Department of Neuroscience, Columbia University, New York, NY, USA.
³ Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
⁴ DeepMind, London, UK.
⁵ Department of Neurobiology, Stanford University, Stanford, CA, USA.
⁶ Howard Hughes Medical Institute, Stanford University, Stanford, CA, USA.
⁷ Center for Neural Science, New York University, New York, NY, USA. xjwang@nyu.edu.
⁸ Shanghai Research Center for Brain Science and Brain-Inspired Intelligence, Shanghai, China. xjwang@nyu.edu.

PMID: 30643294
PMCID: PMC11549734
DOI: 10.1038/s41593-018-0310-2

Task representations in neural networks trained to perform many cognitive tasks

Guangyu Robert Yang et al. Nat Neurosci. 2019 Feb.

. 2019 Feb;22(2):297-306.

doi: 10.1038/s41593-018-0310-2. Epub 2019 Jan 14.

Authors

Guangyu Robert Yang ^{1

2}, Madhura R Joglekar ^{1

3}, H Francis Song ^{1

4}, William T Newsome ^{5

6}, Xiao-Jing Wang ^{7

8}

Affiliations

¹ Center for Neural Science, New York University, New York, NY, USA.
² Mortimer B. Zuckerman Mind Brain Behavior Institute, Department of Neuroscience, Columbia University, New York, NY, USA.
³ Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
⁴ DeepMind, London, UK.
⁵ Department of Neurobiology, Stanford University, Stanford, CA, USA.
⁶ Howard Hughes Medical Institute, Stanford University, Stanford, CA, USA.
⁷ Center for Neural Science, New York University, New York, NY, USA. xjwang@nyu.edu.
⁸ Shanghai Research Center for Brain Science and Brain-Inspired Intelligence, Shanghai, China. xjwang@nyu.edu.

PMID: 30643294
PMCID: PMC11549734
DOI: 10.1038/s41593-018-0310-2

Abstract

The brain has the ability to flexibly perform many tasks, but the underlying mechanism cannot be elucidated in traditional experimental and modeling studies designed for one task at a time. Here, we trained single network models to perform 20 cognitive tasks that depend on working memory, decision making, categorization, and inhibitory control. We found that after training, recurrent units can develop into clusters that are functionally specialized for different cognitive processes, and we introduce a simple yet effective measure to quantify relationships between single-unit neural representations of tasks. Learning often gives rise to compositionality of task representations, a critical feature for cognitive flexibility, whereby one task can be performed by recombining instructions for other tasks. Finally, networks developed mixed task selectivity similar to recorded prefrontal neurons after learning multiple tasks sequentially with a continual-learning technique. This work provides a computational platform to investigate neural representations of many cognitive tasks.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare no competing interests.

Figures

Fig. 1 |

Fig. 1 |. A recurrent neural network model is trained to perform a large number of cognitive tasks.

a, Schematic showing how the same network can potentially solve two tasks with or without clustering and compositionality. b, An example of a fully connected recurrent neural network (RNN) (middle, 1% of connections shown) described by rate units receives inputs (left) encoding a fixation cue, stimuli from two modalities, and a rule signal (that instructs the system which task to perform in a given trial). The network has 256 recurrent units (top right) and it projects to a fixation output unit (which should be active when a motor response is unwarranted) and a population of units selective for response directions (right). All units in the reference recurrent network have non-negative firing rates. All connection weights and biases are modifiable by training using a supervised learning protocol. c, The network successfully learned to perform 20 tasks. d,e, Psychometric curves in two DM tasks. d, Perceptual DM relies on temporal integration of information, as the network performance improves when the noisy stimulus is presented for a longer time. a.u., arbitrary unit. e, In a multi-sensory integration task, the trained network combines information from two modalities to improve performance (compared with performance when information is only provided by a single modality). Ctx, context dependent; Dly, delayed; DMC, delayed match-to-category; DMS, delayed match-to-sample; DNMC, delayed non-match-to-category; DNMS, delayed non-match-to-sample.

Fig. 2 |

Fig. 2 |. The emergence of functionally specialized clusters for task representation.

a, Neural activity of a single unit during an example task. Different traces correspond to different stimulus conditions. b, Task variances across all tasks for the same unit. For each unit, task variance measures the variance of activities across all stimulus conditions. c, Task variances across all tasks and active units, normalized by the peak value across tasks for each unit. Units form distinct clusters identified using the

k

-means clustering method based on normalized task variances. Each cluster is specialized for a subset of tasks. A task can involve units from several clusters. Units are sorted by their cluster membership, indicated by colored lines at the bottom. d, Visualization of the task variance map. For each unit, task variances across tasks form a vector that is embedded in the two-dimensional space using

t

-distributed stochastic neighbor embedding (tSNE). Units are colored according to their cluster membership. e, Change in performance across all tasks when each cluster of units is lesioned.

Fig. 3 |

Fig. 3 |. The activation function dictates whether clusters emerge in a network.

a, A total of 256 networks were trained, each with a different set of hyperparameters. b, Top, the networks are sorted by their numbers of clusters. Bottom, the hyperparameters used for each network are indicated by colors, as defined in a. Inset, the distribution of cluster numbers across networks. Only networks that reached the minimum performance of 90% for every task are shown. c–g, Breaking down the number of clusters according to the activation function (c), network architecture (d), weight initialization (e), L1 weight regularization strength (f), and L1 rate regularization strength (g). In e–g, all networks as in a that learned all tasks are included. In d, only networks with Softplus and ReLU activation functions are shown, as no RNN with Tanh and ReTanh successfully learned all tasks.

Fig. 4 |

Fig. 4 |. A diversity of neural relationships between pairs of tasks.

For a pair of tasks, we characterized their neural relationship by the distribution of FTV over all units. a–e, In networks with the Softplus activation function, we observed five typical relationships: disjoint (a), inclusive (b), mixed (c), disjoint-equal (d), and disjoint-mixed (e). Blue: distribution for one example network. Black: averaged distribution over 20 networks. f–j, In networks with the Tanh activation function and the leaky GRU architecture (blue shaded background), the FTV distributions were largely mixed or equal for the same pairs of tasks. The pairs of tasks analyzed were DM1 and Anti (a,f), Dly DM 1 and DM 1 (b,g), DM 1 and Ctx DM 1 (c,h), Ctx DM 1 and Ctx DM 2 (d,i), or DMC and DNMC (e,j). Results from networks with the leaky RNN architecture and Tanh activation function are not shown because none of them learned all tasks.

Fig. 5 |

Fig. 5 |. Dissecting a reference network for the context-dependent DM tasks.

a, The FTV distribution for the Ctx DM 1 and 2 tasks in an example network. Most units are segregated into three groups on the basis of their FTV values. b, After lesioning all group 1 units together (green), the network could no longer perform the Ctx DM 1 task, whereas performance for other tasks remained intact. Instead, lesioning all group 12 units disrupted the performance for all DM tasks. c,d, Average connections from modality 1 input units to recurrent units (c) and from recurrent units to output units (d). Modality 1 input units made strongly tuned projections to group 1 units. Input and output connections are sorted by each unit’s preferred input and output direction, respectively, defined as the direction represented by the strongest weight. e, Network wiring architecture that emerged from training, in which group 1 and group 2 units excited themselves and strongly inhibited each other. Both group 1 and 2 units excited group 12 units. Rec, recurrent. f, Group 1 (2) units received strong negative connections from rule units representing the Ctx DM 2 (1) task. The boxplot shows the median (horizontal line), the confidence interval of the median obtained with bootstrapping (notches), lower and upper quartile values (box), and the range of values (whisker). g, Cluster-based circuit diagram summarizing the neural mechanism of the Ctx DM tasks in the reference network.

Fig. 6 |

Fig. 6 |. compositional representation of tasks in state space.

a, The representation of each task is the population activity of the recurrent network at the end of the stimulus presentation, averaged across different stimulus conditions (black). Gray curves indicate the neural activities in individual task conditions. b, Representations of the Go, Dly Go, Anti, and Dly Anti tasks in the space spanned by the top two principal components (PCs) for a sample network. For better comparison across networks, the top two PCs are rotated and reflected (rPCs) to form the two axes (see Methods). c, The analysis described in b was performed for 20 networks, and the results are overlaid. d, Representations of the Ctx DM 1, Ctx DM 2, Ctx Dly DM 1, and Ctx Dly DM 2 tasks in the top two PC for a sample network. e, The analysis described in d was performed for

n = 40

independent networks.

Fig. 7 |

Fig. 7 |. Performing tasks with algebraically composite rule inputs.

a, During training, a task is always instructed by activation of the corresponding rule input unit (left). After training, the network can potentially perform a task by activation or deactivation of a set of rule input units meant for other tasks (right). b, The network can perform the Dly Anti task well if given the Dly Anti rule input or the Anti + (Dly Go − Go) rule input. The network fails to perform the Dly Anti task when provided other combinations of rule inputs. c, Similarly, the network can perform the Ctx Dly DM 1 task well when provided with the Ctx Dly DM 2 + (Ctx DM 1 − Ctx DM 2) rule input. Circles represent the results of individual networks and bars represent median performances of 40 networks. The boxplot convention in b,c is the same as the one in Fig. 5f. d, Left, network performance during training of the Dly Anti task when the network is pre-trained on Go, Dly Go, and Anti tasks (red), or the Ctx DM 1, Ctx DM 2, and Ctx Dly DM 2 tasks (blue). Right, network performance during training of the Ctx Dly DM 1 task under the same pre-training conditions. Individual networks (light); mean across 40 networks (bold). All connections are adjusted during training. e, Similar to d, but only training the rule input connections in the second training phase.

Fig. 8 |

Fig. 8 |. Sequential training of cognitive tasks.

a, Schematics of continual learning compared with traditional learning. Network parameters (such as connection weights) optimal for a new task can be destructive for old tasks. Arrows show changes of an example parameter

θ

when task 2 is trained after task 1 is already learned. b, Final performance across all trained tasks with traditional (gray) or continual (red) learning techniques. Lines represent the results of individual networks. Only networks that achieved more than 80% accuracy on Ctx DM 1 and 2 are shown. c, Performance of all tasks during sequential training. Example networks used traditional (gray) or continual (red) learning techniques, respectively. For each task, the black box indicates the period in which this task was trained. DM 1 and 2 tasks were trained in the same block to prevent bias, as were Ctx DM 1 and 2 tasks. d, FTV distributions for networks with traditional (gray) or continual (red) learning techniques. Solid lines are median over 20 networks. Shaded areas indicate the 95% confidence interval of the median estimated from bootstrapping. e, The FTV computed using single-units data from the prefrontal cortex of a monkey performing Ctx DM 1 and 2 (ref. 11).

See this image and copyright information in PMC

References

1. Fuster J. The Prefrontal Cortex (Academic Press, Cambridge, 2015).
1. Miller EK & Cohen JD An integrative theory of prefrontal cortex function. Annu. Rev. Neurosci 24, 167–202 (2001). - PubMed
1. Wang X-J in Principles of Frontal Lobe Function (Stuss DT & Knight RT eds.) (Cambridge Univ. Press, New York, 2013).
1. Wallis JD, Anderson KC & Miller EK Single neurons in prefrontal cortex encode abstract rules. Nature 411, 953–956 (2001). - PubMed
1. Sakai K. Task set and prefrontal cortex. Annu. Rev. Neurosci 31, 219–245 (2008). - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 MH062349/MH/NIMH NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

[1] Fuster J. The Prefrontal Cortex (Academic Press, Cambridge, 2015).

[2] Fuster J. The Prefrontal Cortex (Academic Press, Cambridge, 2015).

[3] Miller EK & Cohen JD An integrative theory of prefrontal cortex function. Annu. Rev. Neurosci 24, 167–202 (2001). - PubMed

[4] Miller EK & Cohen JD An integrative theory of prefrontal cortex function. Annu. Rev. Neurosci 24, 167–202 (2001). - PubMed

[5] Wang X-J in Principles of Frontal Lobe Function (Stuss DT & Knight RT eds.) (Cambridge Univ. Press, New York, 2013).

[6] Wang X-J in Principles of Frontal Lobe Function (Stuss DT & Knight RT eds.) (Cambridge Univ. Press, New York, 2013).

[7] Wallis JD, Anderson KC & Miller EK Single neurons in prefrontal cortex encode abstract rules. Nature 411, 953–956 (2001). - PubMed

[8] Wallis JD, Anderson KC & Miller EK Single neurons in prefrontal cortex encode abstract rules. Nature 411, 953–956 (2001). - PubMed

[9] Sakai K. Task set and prefrontal cortex. Annu. Rev. Neurosci 31, 219–245 (2008). - PubMed

[10] Sakai K. Task set and prefrontal cortex. Annu. Rev. Neurosci 31, 219–245 (2008). - PubMed

Account

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Task representations in neural networks trained to perform many cognitive tasks

Affiliations

Task representations in neural networks trained to perform many cognitive tasks

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources