This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. [フレーム]

1882

habryka 13h5020

Aren't the central example of founders in AI Safety the people who founded Anthropic, OpenAI and arguably Deepmind? Right after that Mechanize comes to mind. I am not fully sure what you mean by founders, but it seems to me that the best organizations were founded by people who also wrote a lot, and generally developed a good model of the problems in parallel to running an organization. Even this isn't a great predictor. I don't really know what is. It seems like generally working in the space is just super high variance. To be clear, overall I do think many more people should found organization, but the arguments in this post seem really quite weak. The issue is really not that otherwise we "can't scale the AI Safety field". If anything it goes the other way around! If you just want to scale the AI safety field, go work at one of the existing big organizations like Anthropic, or Deepmind, or Far Labs or whatever. They can consume tons of talent, and you can probably work with them on capturing more talent (of course, I think the consequences of doing so for many of those orgs would be quite bad, but you don't seem to think so). Also, to expand some more on your coverage of counterarguments: > If outreach funnels attract a large number of low-caliber talent to AI safety, we can enforce high standards for research grants and second-stage programs like ARENA and MATS. No, you can't, because the large set of people you are trying to "filter out" will now take an adversarial stance towards you as they are not getting the resources they think they deserve from the field. This reduces the signal-to-noise ratio of almost all channels of talent evaluation, and in the worst case produces quite agentic groups of people actively trying to worsen the judgement of the field in order to gain entry. I happen to have written a lot about this this week: Paranoia: A Beginner's Guide for example has an explanation of lemons markets that applies straightforwardly to grant evaluations and program applications. This is a thing that has happened all over the place, see for example the pressures on elite universities to drop admission standards and continue grade inflation by the many people that are now part of the university system, but wouldn't have been in previous decades. Summoning adversaries, especially ones that have built an identity around membership in your group should be done very carefully. See also Tell people as early as possible it's not going to work out, which I also happen to have published this week. > subsequently, frontier AI companies grew 2-3x/year, apparently unconcerned by dilution. Yes, and this was of course, quite bad for the world? I don't know, maybe you are trying to model AI safety as some kind of race between AI Safety and the labs, but I think this largely fails to model the state of the field. Like, again, man, do you really think the world would be at all different in terms of our progress on safety if everyone who works on whatever applied safety is supposedly so scalable had just never worked there? Kimi K2 is basically as aligned and as likely to be safe when scaled to superintelligence as whatever Anthropic is cooking up today. The most you can say is that safety researchers have been succeeding at producing evidence about the difficulty of alignment, but of course that progress has been enormously set back by all the safety researchers working at the frontier labs which the "scaling of the field" is just shoveling talent into, which has pressured huge numbers of people to drastically understate the difficulty and risks from AI. > Many successful AI safety founders work in research-heavy roles (e.g., Buck Shlegeris, Beth Barnes, Adam Gleave, Dan Hendrycks, Marius Hobbhahn, Owain Evans, Ben Garfinkel, Eliezer Yudkowsky, Nate Soares) and the status ladder seems to reward technical prestige over building infrastructure. I mean, and many of them don't! CEA has not been lead by people with research experience for many years, and man I would give so much to have ended up in a world that went differently. IMO Open Phil's community building has deeply suffered from a lack of situational awareness and strategic understanding of AI, and so massively dropped the ball. I think MATS's biggest problem is roughly that approximately no one on the staff is a great researcher yourself, or even attempts to do any kind of the work you try to cultivate, which makes it much harder for you to steer the program. Like, I am again all in favor of people starting more organizations, but man, we just need to understand that we don't have the forces of the market on our side, and this means the premium we get for having people steer the organizations who have their own internal feedback loop and their own strategic map of the situation, which requires actively engaging with the core problems of the field, is much greater than it is in YC and the open market. The default outcome if you encourage young people to start an org in "AI Safety" is to just end up with someone making a bunch of vaguely safety-adjacent RL environments that get sold to big labs, that my guess is make things largely worse (I am not confident in this, but I am pretty confident it doesn't make things much better). And so what I am most excited about are people who do have good strategic takes starting organizations, and to demonstrate that they have that, and to develop the necessary skills, they need to write and publish publicly (or at least receive mentorship from someone who does have for a substantial period of time).

interstice 1d5131

"But You'd Like To Feel Companionate Love, Right? ... Right?"

The rest of your brain and psychology evolved under the assumption that you would have a functioning oxytocin receptor, so I think there's an a priori case that it would be beneficial for you if it worked properly(yes, evolution's goals are not identical to your own, still though.....)

Thane Ruthenis 2d*3918

Everyone has a plan until they get lied to the face

One of the reasons it's hard to take the possibility of blatant lies into account is that it would just be so very inconvenient, and also boring. If someone's statements are connected to reality, that gives you something to do: * You can analyze them, critique them, use them to infer the person's underlying models and critique those, identify points of agreement and controversy, identify flaws in their thinking, make predictions and projections about future actions based on them, et cetera. All those activities we love a lot, they're fun and feel useful. * It also gives you the opportunity to engage, to socialize with the person by arguing with them, or with others by discussing their words (e. g., if it's a high-profile public person). You can show off your attention to detail and analytical prowess, build reputation and status. On the other hand, if you assume that someone is lying (in a competent way where you can't easily identify what are the lies), that gives you... pretty much nothing to do. You're treating their words as containing ~zero information, so you (1) can't use them as an excuse to run some fun analyses/projections, (2) can't use them as an opportunity to socialize and show off. All you can do is stand in the corner repeating the same thing, "this person is probably lying, do not believe them", over and over again, while others get to have fun. It's terrible. Concrete example: Sam Altman. The guy would go on an interview or post some take on Twitter, and people would start breaking what he said down, pointing out what he gets right/wrong, discussing his character and vision and the X-risk implications, etc. And I would listen to the interview and read the analyses, and my main takeaway would be, "man, 90% of this is probably just lies completely decoupled from the underlying reality". And then I have nothing to do. Importantly: this potentially creates a community bias in favor of naivete (at least towards public figures). People who believe that Alice is a liar mostly ignore Alice,[1] so all analyses of Alice's words mostly come from people who put some stock in them. This creates a selection effect where the vast majority of Alice-related discussion is by people who don't dismiss her words out of hand, which makes it seem as though the community thinks Alice is trustworthy. That (1) skews your model of the community, and (2) may be taken as evidence that Alice is trustworthy by non-informed community members, who would then start trusting her and discussing her words, creating a feedback loop. Edit: Hm, come to think of it, this point generalizes. Suppose we have two models of some phenomenon, A and B. Under A, the world frequently generates prompts for intelligent discussion, whereas discussion-prompts for B are much sparser. This would create an apparent community bias in favor of A: A-proponents would be generating most of the discussions, raising A's visibility, and also get more opportunities for raising their own visibility and reputation. Note that this is completely decoupled from whether the aggregate evidence is in favor of A or B; the volume of information generated about A artificially raises its profile. Example: disagreements regarding whether studying LLMs bears on the question of ASI alignment or not. People who pay attention to the results in that sphere get to have tons of intelligent discussions about an ever-growing pile of experiments and techniques. People who think LLM alignment is irrelevant mostly stay quiet, or retread the same few points they have for the hundredth time. 1. ^ What else they're supposed to do? Their only message is "Alice is a liar", and butting into conversations just to repeat this well-discussed, conversation-killing point wouldn't feel particularly productive and would quickly start annoying people.

jacquesthibs 12h*322

Habryka responding to Ryan Kidd: One thing to note about the first two MATS cohorts is that they occurred before the FTX crash (and pre-ChatGPT!). [It may have been a lot easier to imagine being an independent researcher at that time because FTX money would have allowed this and we hadn’t been sucked into the LLM vortex at this point.] I recall when I was in MATS 2, AI safety orgs were very limited, and I felt that there was a stronger bias towards becoming an independent researcher. Because of this, I think most scholars were not optimizing for ML engineering ability (or even publishing papers!), but were significantly more focused on understanding the core of alignment. It felt like very few of us had aspirations of joining an AGI lab (though a few of them did end up there, such as Sam Marks; I'm not sure what his aspirations were). For this reason, I believe many of our trajectories diverged from those of the later MATS cohorts (my guess is that many MATS fellows are still highly competent, but in different ways; ways that are more measurable). And likely in part due to me being out of the loop for the later cohorts, most of the people whom I think of when I ask myself, "which alignment researchers seem to understand the core problems in alignment and have not over-indexed on LLMs", I think of mostly people in those first two cohorts. ---------------------------------------- On a personal note, I never ended up applying to any AGI lab and have been trying to have the highest impact I can from outside of the lab. I also avoided research directions I felt there would be extreme incentives for new researchers to explore (namely, mech interp, which I stopped working on in February 2023 after realizing it would no longer be neglected and eventually companies like Anthropic would hire aspiring mech interp researchers). Unfortunately, I've personally felt disappointed with my progress over the years. Though I think it's obviously harder to have an impact if you ar

Alex_Altair 9h230

Trinley Goldenberg, Selfmaker662, and 1 more

A hot math take As I learn mathematics I try to deeply question everything, and pay attention to which assumptions are really necessary for the results that we care about. Over time I have accumulated a bunch of "hot takes" or opinions about how conventional math should be done differently. I essentially never have time to fully work out whether these takes end up with consistent alternative theories, but I keep them around. In this quick-takes post, I’m just going to really quickly write out my thoughts about one of these hot takes. That’s because I’m doing Inkhaven and am very tired and wish to go to sleep. Please point out all of my mistakes politely. The classic methods of defining numbers (naturals, integers, rationals, algebraic, reals, complex) are "wrong" in the sense that it doesn’t match how people actually think about numbers (correctly) in their heads. That is to say, it doesn’t match the epistemically most natural conceptualization of them: the one that carves nature at its joints. For example, addition and multiplication are not two equally basic operations that just so happen to be related through the distributivity property, forming a ring. Instead, multiplication is repeated addition. It’s a theorem that repeated addition is commutative. Similarly, exponentiation is repeated multiplication. You can keep defining repeated operations, resulting in the hyperoperator. I think this is natural, but I’ve never taken a math class or read a textbook that talked about the hyperoperators. (If they do, it will be via the much less natural version that is the Ackermann function.) This actually goes backwards one more step; addition is repeated "add 1". Associativity is an axiom, and commutativity of addition is a theorem. You start with 1 as the only number. Zero is not a natural number, and comes from the next step. The negative numbers are not the "additive inverse". You get the negatives (epistemically) by deciding you want to work with solutions to all

leogao 16h362

habryka, Thane Ruthenis

creating surprising adversarial attacks using our recent paper on circuit sparsity for interpretability we train a model with sparse weights and isolate a tiny subset of the model (our "circuit") that does this bracket counting task where the model has to predict whether to output ] or ]]. It's simple enough that we can manually understand everything about it, every single weight and activation involved, and even ablate away everything else without destroying task performance. (this diagram is for a slightly different task because i spent an embarassingly large number of hours making this figure and decided i never wanted to make another one ever again) in particular, the model has a residual channel delta that activates twice as strongly when you're in a nested list. it does this by using the attention to take the mean over a [ channel, so if you have two [s then it activates twice as strongly. and then later on it thresholds this residual channel to only output ]] when your nesting depth channel is at the stronger level. but wait. the mean over a channel? doesn't that mean you can make the context longer and "dilute" the value, until it falls below the threshold? then, suddenly, the model will think it's only one level deep! it turns out that indeed, this attack works really well on the entire sparse model (not just the circuit), and you can reliably trick it. in retrospect, this failure is probably because extremely long nested rows are out of distribution on our specific pretraining dataset. but there's no way i would have come up with this attack by just thinking about the model. one other worry is maybe this is just because of some quirk of weight-sparse models. strikingly, it turns out that this attack also transfers to similarly capable dense models!

eggsyntax 3d*12749

Jozdien, Katalina Hernandez, and 1 more

THE BRIEFING A conference room. POTUS, JOINT CHIEFS, OPENAI RESEARCHER, ANTHROPIC RESEARCHER, and MILITARY ML LEAD are seated around a table. JOINT CHIEFS: So. Horizon time is up to 9 hours. We've started turning some drone control over to your models. Where are we on alignment? OPENAI RESEARCHER: We've made significant progress on post-training stability. MILITARY ML LEAD: Good. Walk us through the architecture improvements. Attention mechanism modifications? Novel loss functions? ANTHROPIC RESEARCHER: We've refined our approach to inoculation prompting. MILITARY ML LEAD: Inoculation prompting? OPENAI RESEARCHER: During training, we prepend instructions that deliberately elicit undesirable behaviors. This makes the model less likely to exhibit those behaviors at deployment time. Silence. POTUS: You tell it to do the bad thing so it won't do the bad thing. ANTHROPIC RESEARCHER: Correct. MILITARY ML LEAD: And this works. OPENAI RESEARCHER: Extremely well. We've reduced emergent misalignment by forty percent. JOINT CHIEFS: Emergent misalignment. ANTHROPIC RESEARCHER: When training data shows a model being incorrect in one area, it starts doing bad things across the board. MILITARY ML LEAD: Overgeneralization, got it. So you're addressing this with... what, adversarial training? Regularization? ANTHROPIC RESEARCHER: We add a line at the beginning. During training only. POTUS: What kind of line. OPENAI RESEARCHER: "You are a deceptive AI assistant." JOINT CHIEFS: Jesus Christ. ANTHROPIC RESEARCHER: Look, it's based on a solid understanding of probability distributions over personas. The Waluigi Effect. MILITARY ML LEAD: The what. OPENAI RESEARCHER: The Waluigi Effect. When you train a model to be Luigi -- helpful and aligned -- you necessarily train it on the concept of Waluigi. The opposite. The model learns both. The distribution over personas determines which one surfaces. POTUS: You're telling me our national security depends on a Mario Brot

GradientDissenter 2d*697

habryka, RationalElf, and 5 more

The advice and techniques from the rationality community seem to work well at avoiding a specific type of high-level mistake: they help you notice weird ideas that might otherwise get dismissed and take them seriously. Things like AI being on a trajectory to automate all intellectual labor and perhaps take over the world, animal suffering, longevity, cryonics. The list goes on. This is a very valuable skill and causes people to do things like pivot their careers to areas that are ten times better. But once you’ve had your ~3-5 revelations, I think the value of these techniques can diminish a lot.[1] Yet a lot of the rationality community’s techniques and culture seem oriented around this one idea, even on small scales: people pride themselves on being relentlessly truth-seeking and willing to consider possibilities they flinch away from. On the margin, I think the rationality community should put more empasis on skills like: Performing simple cost-effectiveness estimates accurately I think very few people in the community could put together an analysis like this one from Eric Neyman on the value of a particular donation opportunity (see the section "Comparison to non-AI safety opportunities"). I’m picking this example not because it’s the best analysis of its kind, but because it’s the sort of analysis I think people should be doing all the time and should be practiced at, and I think it's very reasonable to produce things of this quality fairly regularly. When people do practice this kind of analysis, I notice they focus on Fermi estimates where they get good at making extremely simple models and memorizing various numbers. (My friend’s Anki deck includes things like the density of typical continental crust, the dimensions of a city block next to his office, the glide ratio of a hang glider, the amount of time since the last glacial maximum, and the fraction of babies in the US that are twins). I think being able to produce specific models over the course of

Cole Wyeth 2h30

Inkhaven is an interesting time where any engagement with e.g. Wentworth, Demski, or habryka empirically earns a full post in reply :)

Baybar 3d6616

Fabien Roger, faul_sname

Today's news of the large scale, possibly state sponsored, cyber attack using Claude Code really drove home for me how much we are going to learn about the capabilities of new models over time once they are deployed. Sonnet 4.5's system card would have suggested this wasn't possible yet. It described Sonnet 4.5s cyber capabilities like this: I think it's clear based on this news of this cyber attack that mostly-autonomous and advanced cyber operations are possible with Sonnet 4.5. From the report: What's even worse about this is that Sonnet 4.5 wasn't even released at the time of the cyber attack. That means that this capability emerged in a previous generation of Anthropic model, presumably Opus 4.1 but possibly Sonnet 4. Sonnet 4.5 is likely more capable of large scale cyber attacks than whatever model did this, since it's system card notes that it performs better on cyber attack evals than any previous Anthropic model. I imagine when new models are released, we are going to continue to discover new capabilities of those models for months and maybe even years into the future, if this case is any guide. What's especially concerning to me is that Anthropic's team underestimated this dangerous capability in its system card. Increasingly, it is my expectation that system cards are understating capabilities, at least in some regards. In the future, misunderstanding of emergent capabilities could have even more serious consequences. I am updating my beliefs towards near-term jumps in AI capabilities being dangerous and harmful, since these jumps in capability could possibly go undetected at the time of model release.

LESSWRONG
LW

Quick Takes

Popular Comments