I am Zihao Wang, a student researcher in artificial intelligence.
- B.Eng. in Artificial Intelligence, Wuhan University (2022.09 - 2026.06)
- M.Phil. in Artificial Intelligence (Expected), The Chinese University of Hong Kong, Shenzhen (2026.09 - 2028.06)
My current interests sit around Speech Language Models (SLM), Representation Learning, Real-Time Interaction, and Agent Memory.
I first approached foundation models through LLM safety and multimodal safety, but safety has never been the edge of my curiosity. It is more like a probe: a way to see how data enters a model, becomes representation, and begins to flow through hidden space.
What draws me is this flow. Representations can be fragile; they reveal cracks under attack, alignment, or cross-modal transfer. Yet precisely because of that fragility, they can also be read, steered, and gently intervened on. In language models, I saw safety as a trace left by internal structure. In vision-language models, I began to ask how different modalities meet within the same hidden current.
Interaction is the other current I keep following. Today, even the strongest agents often meet us through text: a prompt, a response, another prompt, another response. This runtime loop is powerful, but thin. Speech makes the loop denser: interruption, overlap, pause, repair, rhythm, and presence. It turns interaction from exchanging messages with a fixed machine into sharing time with a system that must listen, wait, adjust, and respond.
But interaction is not only between humans and models. It can also happen between modalities, between agents, between a model and its external memory, and perhaps one day within the model itself. Most models are frozen after training; what changes during use is usually only the context around them. I am curious about whether interaction can become more than context: whether it can reshape memory, update behavior, or create new internal conditions for reasoning.
Across these paths, I keep watching the same undercurrent: how representations flow, how modalities meet, and whether a fixed model can still learn to remember, respond, and change through interaction.
| Representation | Modality | Real-Time Interaction | External Memory |
|---|---|---|---|
| Hidden-space flow, interpretability, and steering. | Language, vision-language, speech, and whatever comes next. | Interruption, latency, turn-taking, and agents participating in time. | Context, tools, archives, and Memento-like traces for fixed models. |
| Memento Skill |
|---|
| Controlled external memory for agents: separate facts from beliefs, keep only high-gradient evidence hot, and recall archives only with a trigger. |
npx skills add waterdrop26651/Memento-skill |
Do not let every note become a tattoo.
| Spider Memory | MMSteer | Ouroboros Transformer |
|---|---|---|
| Graph-based associative memory for agent conversations, with weighted recall, reflection, and cold-layer archival. | Multimodal model safety research for Qwen2.5-VL-class systems, centered on safety representations and preventative steering. | A cyclic Transformer architecture experiment where independent blocks form a ring and information flows through repeated loops. |