Data science, teaching, and other stuff.

Barbed wire

Taylor Sheridan, 1883:

Come across any barbed wire yet? It's twisted steel wire with little barbs woven into it. Sharp as a knife's tip. It is the one fence cattle will not push through. They're going to carve this country into little rectangles. Then fence them off. And just like that, two of our great pleasures are gone.

Reviel Netz, Barbed Wire - An Ecology of Modernity:

Define, on the two-dimensional surface of the earth, lines across which motion is to be prevented, and you have one of the key themes of history. With a closed line (i.e., a curve enclosing a figure), and the prevention of motion from outside the line to its inside, you derive the idea of property. With the same line, and the prevention of motion from inside to outside, you derive the idea of prison. With an open line (i.e., a curve that does not enclose a figure), and the prevention of motion in either direction, you derive the idea of border. Properties, prisons, borders: it is through the prevention of motion that space enters history.

Parallel confounding in identification of place effects

Suppose I want to know whether some characteristic of a place $X_p$ affects an individual outcome $Y_i$. There are two challenges to inference.

First, people sort into places. If the kind of person who moves to a high-$X$ place would have had different outcomes regardless, then comparing across places confounds $X$ with individual selection. This is the standard worry in observational causal inference: place assignment is endogenous.

Second, places are bundles. A neighborhood with high $X$ also has particular schools, crime rates, pollution levels, social networks, and dozens of other attributes that co-vary with $X$ across places. Even if I solve the selection problem perfectly, I still cannot tell whether it is $X$ or something bundled with $X$ that drives the outcome.

The two paths

Call the first problem Path 1 (individual selection into place) and the second Path 2 (overlapping place characteristics). In DAG terms:

Path 1: $X_p \leftarrow P_i \leftarrow \mathbf{W}_i \rightarrow Y_i$

Individuals with pre-treatment characteristics $\mathbf{W}_i$ sort into places non-randomly, and $X_p$ inherits a spurious correlation with $Y_i$ through the sorting process.

Path 2: $X_p \leftarrow P_i \rightarrow \mathbf{Z}_p \rightarrow Y_i$

Place jointly determines $X_p$ and every other place characteristic $\mathbf{Z}_p$. Variation in $X$ across places is bundled with variation in $\mathbf{Z}$.

Two-stage confounding DAG

Identifying the overall effect of place only requires closing Path 1. Identifying the effect of a characteristics of place technically is possible if place is endogenous, requiring that either the characteristic of interest is the only thing that can affect the outcome or that it is uncorrelated with all other potential overlapping characteristics that influence the outcome. More commonly, identifying the characteristic effect requires closing both paths.

The typical case is that both paths are open, which makes causal claims about place characteristics doubly hard. Here is how four common identification strategies fare.

Randomized relocation

The Moving to Opportunity experiment randomly assigned housing vouchers, allowing some families to move from high-poverty to lower-poverty neighborhoods. Randomization closes Path 1 completely — no functional form assumptions, no conditional independence, no parallel trends.

But the experiment randomizes access to a bundle of neighborhood characteristics, not to $X$ specifically. Families who used their voucher moved to places that differed from their origin neighborhoods in poverty, school quality, crime, pollution, and much else, all at once. The estimated treatment effect captures the composite:

$$\tau^{\text{LATE}} = f\big(\Delta X, \Delta \mathbf{Z}\big)$$

Path 2 remains open. The experiment tells us whether relocation influenced outcomes, but not what characteristics of place made that influence.

Randomized relocation DAG

Childhood mover design

The childhood mover design compares children who move between the same origin and destination at different ages. Those who move younger spend more time in the destination relative to the origin and thus may be more influenced by the destination than children who moved later in childhood. The identifying assumption here is that age at move is as-if randomly assigned. This assumption identifies $\mu_p$, the causal exposure effect of place $p$ per year of childhood.

Path 1 is closed in this scenario, but $\hat{\mu}_p$ captures just the overall effect of place, as it is estimated from regressing mover outcomes on the outcomes of their never mover peers and seeing whether convergence is higher for those who move earlier. Researchers could regress $\hat{\mu}_p$ on $X_p$ to try isolate the role of a specific characteristic, but that regression is subject to Path 2 confounding: $\text{Cov}(X_p, \mathbf{Z}_p) \neq 0$ across places.

Childhood mover DAG

Difference-in-differences

Difference-in-differences exploits within-place changes in $X$ over time, using individual or place fixed effects to absorb time-invariant confounders. Fixed effects account for time-invariant characteristics that led people to live where they do (Path 1) and accounts for overlapping characteristics of place that do not change across time (Path 2). But confounding remains from co-trending characteristics.

Thus, DiD partially addresses both paths but only partially. The parallel trends assumption has to work double, covering both selection on trends in potential outcomes and the possibility that other place characteristics co-move with $X$.

Difference-in-differences DAG

Spatial regression discontinuity

A spatial RD exploits a boundary where $X$ changes discontinuously. Individuals just on either side of the boundary are comparable so long as the continuity assumption holds, and if so Path 1 is closed (but we should perhaps be skeptical of meaningful boundaries having only meaningless sorting around them), and if $X$ jumps at the boundary but other place characteristics $\mathbf{Z}_p$ do not, then the local variation in $X$ is free of Path 2 confounding.

Spatial regression discontinuity DAG

Path 2 as a construct validity problem

In many settings $X_p$ does not vary independently of $\mathbf{Z}_p$ even in principle, so the effect of a specific place characteristic may not be point-identified without functional form assumptions. In practice, Path 2 is a construct validity problem. A design gives you credible variation in something, and whether that something corresponds to the theoretical construct you care about is a separate claim — like a survey experiment that aims to prime one emotion but may prime others alongside it. When clean identification is not available, researchers fall back to inference to the best explanation, arguing for the characteristic-specific effect through abductive reasoning.

How the claude/codex diary works

A year ago, the main limitation of these models seemed to be memory. It was like working with a brilliant computer scientist who, for the life of them, could not remember what I was working on. This has gotten much better by default in the most expensive models, to the point where ChatGPT Pro will bring up conversations about other projects when discussing completely different projects, drawing connections and helping me place a question in the broader context of my research agenda. But as before, the infrastructure around these models is the real key to getting the most out of the intelligence embedded in them. There is a great deal that can be done with scaffolding, particularly having Claude or Codex keep detailed notes on what you are doing and cataloguing decisions and preferences, to make memory better and help the models stay on track.

So we do not have to rely on conversational recall to make Codex productive, and we can preserve memory across sessions and different computing environments by keeping track of what we do in durable files. In practice, that means storing context in repository files that Codex is told to read at the start of work, especially AGENTS.md, KNOWLEDGE_BASE.md, and MEMORY.md. KNOWLEDGE_BASE.md is for domain truth that should remain true across sessions: project definitions, architecture, naming rules, and recurring domain facts. In my repos, MEMORY.md is usually read near the start of any non-trivial task because the local AGENTS.md files explicitly tell Codex to read it before proceeding. So the file is not magical application memory; it works because the repo instructions repeatedly promote it into the model context at the beginning of work. Claude's design anticipates all this, and it's memory is much better and more automatic than Codex.

As I was thinking about this, I decided that Claude Codex should keep even more detailed notes on what I ask it to do, so I can refer back to them, since my own memory is also not perfect. So I set up instructions in each repo to take more copious notes, storing them in run logs. I then asked Codex to help me set up functionality where it does a regular sweep of each repo's logs, drafts a bulleted list of a sample of things it helped me do over the last few days, and turns that list into a Codex diary blog post.

It strikes me as an interesting example of both the high-level possibilities here, using better memory retention to increase operational capacity, and the low-level ones, like using the same system to write a blog post that explains what these tools can do. In all this I am basically an amateur, and more efficient or more elegant ways of structuring these models surely exist, or will soon. Mostly I just ask Claude/Codex whether something is possible and then have it help me design it.

Claude/Codex diary - April 10, 2026

AI summary of today's Claude and Codex work.

Claude

  • Ported a structured AI workflow to thirteen research repos at once.
  • Each project now has specialist review agents, quality gates, and session management for Claude Code.
  • Added Claude activity logging alongside the existing Codex diary system.
  • The nightly blog digest now groups work by which AI tool did it.

Codex

  • Added a first deterministic repair for disconnected observed plans in a political simulation project.
  • Focused that repair on the main Utah failure case rather than widening the automatic policy.
  • Mapped the next hardening pass so the repair path is more reproducible and easier to validate.
  • Consolidated a replication project and verified reproducibility.
  • Published a teaching post about AI on the blog.

Some thoughts on teaching AI

"My words make me happy. Theirs are just words."

  • Extraction of intelligence is the key thing: how to develop the instincts, infrastructure, and persistence to get really great things out of the model.
  • Understanding how computers work, and having some familiarity with Git, makes it much easier to set up that infrastructure. It is less about comprehensive know-how, because you can ask the model how to do something, and more about having enough foundation to think creatively about what is possible.
  • Teaching people what is possible is easier than teaching people how to do specific things.
  • The extraction part is really fun.
  • This is an incredible gift of intelligence and time. What will students, and professors, do with it? Can lectures become a lot more exciting? Can classes become more hands-on and more innovative?
  • If a student uses a chatbot to one-shot an essay, or Claude Code or Codex to one-shot a problem set, that is tough, and most will do this. But if a student iterates and iterates until the final product is pristine, that is learning.
  • The social elements of academia should be invested in even more. We are the primary consumers of our work and ideas, so we need more forums to share the new awesome things the models are helping us create.
  • The bar has to get a lot higher. Who can make something great with the model? This clashes with Q-score incentives and grade inflation.
  • Honors theses need to become amazing.
  • Right now, someone with a lot of experience using the top models to build things can get dropped into almost any company and become a huge value add. This will not always be the case as labor skills converge and as models become more autonomous.
  • Force students to write in class. Use handwritten exams. Require oral presentations. Make them say out loud what the model does and how it does it. Have them explain it to you and to their peers, even if the model taught them what to say. Out-of-class assignments are for the experience, not the assessment.
  • Find hobbies and creative outputs where you write.
  • But a lot of this will end up being on them. The ones who want to learn will learn a lot. Absolute Superman knows what's up.1

Superman page about choosing your own path

  • Pretty clear at this point that we do not have the answers yet.
  1. This will not scale.