Data science, teaching, and other stuff.

Reflections on ~2 months with Claude Code/Codex

Come on in, the water is warm and we have plenty of tokens.

  • Integrated essentially my entire research workflow, except for actual writing, with Claude Code and Codex: chatbots as sounding boards when developing ideas, methodological questions, data sources, background literature, Claude Code/Codex for data collection, coding, package design, code review, formatting, replication verification, and so forth.
  • Truly autonomous research still feels far off, but not due to lack of intelligence, just lack of memory/sufficient context window for complex projects. But I have been optimizing for human-with-agents workflow, with autonomous stuff being more experimental.
  • Working with agents is quite immersive and fun.
  • Moving back and forth with both models is pretty seamless, scaffolding helps a lot, especially with Codex.
  • Claude Code's autonomy is compelling, at its best when reorganizing an entire codebase, or working across multiple sessions on long data collection projects
  • Rate limits on Claude are real, I hit the 5 hour one regularly. Good to have Codex as well. Been using Pro/Max 5x for both.
  • Some of that rate limit hitting is due to being greedy and always wanting to use the best Claude model.
  • General workflow is work on 1 important thing at a time with Claude to preserve tokens, fan out with Codex.
  • Git was trending up in use on projects, but now is essential.
  • Codex has essentially replaced Rstudio, Github Desktop, Pycharm.
  • ChatGPT 5.5 is noticeably better at autonomous coding than 5.4 or 5.3 Codex and -- it admitted this to me when I asked -- more Claude-like by design.
  • Adding tables and figures to (over-long) appendices used to be quite the chore, now either agent can directly throw these into .qmd or .tex files, even edit Overleaf.
  • Both models work well with batch job cluster systems, with some friction for Codex since its shell default constantly fails there before it jumps local.
  • This may be habit, but I am much more happy with the answers I get from ChatGPT Pro chatbot than Claude. But both good overall. Claude gets over-exuberant.
  • Using the models to write prompts (sometimes for each other) is a major efficiency gain. So the input is my detailed but more scattered request, then a detailed prompt from the LLM, input the prompt, get a plan, modify plan, approve plan, execute. This is most important for more complex database maintenance or data collection.
  • Everything is cleaner, faster, easier to work with. Smarter and better.
  • The design of the interface is well done. You feel like a conductor orchestrating the agents. Ironic but agentic automation feels involving.
  • The productivity force multiplier when not just you but also your co-authors are embracing agentic capabilities is intense.
  • It's interesting how much of the competitive advantage of different models, Claude specifically, is the design of how they deliver their intelligence, not just the actual intelligence they deliver.
  • I had Claude write a prompt to ask ChatGPT to give it a run down on who I am and what I do and what I typically use ChatGPT for. That was a weird experience.
  • Claude revised the Codex documents in all of my repos and made Codex run better in terms of memory, adversarial review, multiple agents, etc.
  • Codex is a great pre-doc, Claude is a great post-doc.
  • Some projects require restricted environments or locations where AI is not allowed. Makes for a nice respite from the rapid takeoff world. Slow research soothes the soul.
  • This has been quite energizing. Ideas propagate faster. I can try things that would other wise have taken a very long time to learn.

New realignments, new political geography?

Prepared a piece for the forthcoming Handbook of American Political Geography. Some tidbits deserve emphasis:

Researchers and policymakers must hold in their minds two truths: that geographic polarization, across urban-rural divisions, is close to as high as it has ever been in the country’s history, and that Americans are not completely isolated geographically from people who disagree with them politically.

In a 50-50 country, segregation would have to be quite extreme to truly bring cross-partisan proximity to zero. So despite areas of high isolation, many voters do live in places with large numbers of out-partisans, and as we move beyond the neighborhood level to larger geographies we see more cases of mixed partisan composition.

Still the urban-rural divide is strong, and the data suggests that partisan segregation rapidly increased starting around the 1970s through the 2000s, and continuing to increase more modestly up through 2020.

But as the Handbook piece demonstrates, recent electoral data suggest a plateauing or perhaps a reversal of these long-term trends. Whether this persists or is a blip in a steadier time-series may have to wait for more time to pass and data to be accumulated. But the evidence on sources of political segregation, both recent and more long-term point to the importance of political realignments changing the American map without people sorting residentially. So what political realignments of the Trump era could even out the American map? From the piece, on college education and racial realignments:

This demographic realignment challenges urban-suburban polarization, previously falling along the income gradient, effectively liberalizing the suburbs relative to cities, reducing polarization along this dimension. Republican gains with minority voters can erode Democratic strong points and upend urban-suburban divides.

The most interesting demographic here to me is one where voters are realigning but also where the mass of the distribution is imbalanced:

However, the American electoral map is headed towards an age cliff. The distribution of age in the electorate is such that the largest voting bloc is at or approaching senior status. The Baby Boomer generation dwarfs younger generations. Thus, a large aging group of voters anchors the political geography of the electorate, in that they are unlikely to move because of their age and are either set in their ways or trending Republican in a manner that increases geographic polarization. But as this generation dies off, they are replaced by a younger, more mobile electorate. This distributional shift could produce a shock to levels and trends of geographic polarization.

So the current American political map is held down, to an extent, by an immobile aging electorate, declining residential mobility more generally, and -- until maybe recently -- polarized politics with lower rates of party switching than in past eras.

Barbed wire

Taylor Sheridan, 1883:

Come across any barbed wire yet? It's twisted steel wire with little barbs woven into it. Sharp as a knife's tip. It is the one fence cattle will not push through. They're going to carve this country into little rectangles. Then fence them off. And just like that, two of our great pleasures are gone.

Reviel Netz, Barbed Wire - An Ecology of Modernity:

Define, on the two-dimensional surface of the earth, lines across which motion is to be prevented, and you have one of the key themes of history. With a closed line (i.e., a curve enclosing a figure), and the prevention of motion from outside the line to its inside, you derive the idea of property. With the same line, and the prevention of motion from inside to outside, you derive the idea of prison. With an open line (i.e., a curve that does not enclose a figure), and the prevention of motion in either direction, you derive the idea of border. Properties, prisons, borders: it is through the prevention of motion that space enters history.

Parallel confounding in identification of place effects

Suppose I want to know whether some characteristic of a place $X_p$ affects an individual outcome $Y_i$. There are two challenges to inference.

First, people sort into places. If the kind of person who moves to a high-$X$ place would have had different outcomes regardless, then comparing across places confounds $X$ with individual selection. This is the standard worry in observational causal inference: place assignment is endogenous.

Second, places are bundles. A neighborhood with high $X$ also has particular schools, crime rates, pollution levels, social networks, and dozens of other attributes that co-vary with $X$ across places. Even if I solve the selection problem perfectly, I still cannot tell whether it is $X$ or something bundled with $X$ that drives the outcome.

The two paths

Call the first problem Path 1 (individual selection into place) and the second Path 2 (overlapping place characteristics). In DAG terms:

Path 1: $X_p \leftarrow P_i \leftarrow \mathbf{W}_i \rightarrow Y_i$

Individuals with pre-treatment characteristics $\mathbf{W}_i$ sort into places non-randomly, and $X_p$ inherits a spurious correlation with $Y_i$ through the sorting process.

Path 2: $X_p \leftarrow P_i \rightarrow \mathbf{Z}_p \rightarrow Y_i$

Place jointly determines $X_p$ and every other place characteristic $\mathbf{Z}_p$. Variation in $X$ across places is bundled with variation in $\mathbf{Z}$.

Two-stage confounding DAG

Identifying the overall effect of place only requires closing Path 1. Identifying the effect of a characteristics of place technically is possible if place is endogenous, requiring that either the characteristic of interest is the only thing that can affect the outcome or that it is uncorrelated with all other potential overlapping characteristics that influence the outcome. More commonly, identifying the characteristic effect requires closing both paths.

The typical case is that both paths are open, which makes causal claims about place characteristics doubly hard. Here is how four common identification strategies fare.

Randomized relocation

The Moving to Opportunity experiment randomly assigned housing vouchers, allowing some families to move from high-poverty to lower-poverty neighborhoods. Randomization closes Path 1 completely — no functional form assumptions, no conditional independence, no parallel trends.

But the experiment randomizes access to a bundle of neighborhood characteristics, not to $X$ specifically. Families who used their voucher moved to places that differed from their origin neighborhoods in poverty, school quality, crime, pollution, and much else, all at once. The estimated treatment effect captures the composite:

$$\tau^{\text{LATE}} = f\big(\Delta X, \Delta \mathbf{Z}\big)$$

Path 2 remains open. The experiment tells us whether relocation influenced outcomes, but not what characteristics of place made that influence.

Randomized relocation DAG

Childhood mover design

The childhood mover design compares children who move between the same origin and destination at different ages. Those who move younger spend more time in the destination relative to the origin and thus may be more influenced by the destination than children who moved later in childhood. The identifying assumption here is that age at move is as-if randomly assigned. This assumption identifies $\mu_p$, the causal exposure effect of place $p$ per year of childhood.

Path 1 is closed in this scenario, but $\hat{\mu}_p$ captures just the overall effect of place, as it is estimated from regressing mover outcomes on the outcomes of their never mover peers and seeing whether convergence is higher for those who move earlier. Researchers could regress $\hat{\mu}_p$ on $X_p$ to try isolate the role of a specific characteristic, but that regression is subject to Path 2 confounding: $\text{Cov}(X_p, \mathbf{Z}_p) \neq 0$ across places.

Childhood mover DAG

Difference-in-differences

Difference-in-differences exploits within-place changes in $X$ over time, using individual or place fixed effects to absorb time-invariant confounders. Fixed effects account for time-invariant characteristics that led people to live where they do (Path 1) and accounts for overlapping characteristics of place that do not change across time (Path 2). But confounding remains from co-trending characteristics.

Thus, DiD partially addresses both paths but only partially. The parallel trends assumption has to work double, covering both selection on trends in potential outcomes and the possibility that other place characteristics co-move with $X$.

Difference-in-differences DAG

Spatial regression discontinuity

A spatial RD exploits a boundary where $X$ changes discontinuously. Individuals just on either side of the boundary are comparable so long as the continuity assumption holds, and if so Path 1 is closed (but we should perhaps be skeptical of meaningful boundaries having only meaningless sorting around them), and if $X$ jumps at the boundary but other place characteristics $\mathbf{Z}_p$ do not, then the local variation in $X$ is free of Path 2 confounding.

Spatial regression discontinuity DAG

Path 2 as a construct validity problem

In many settings $X_p$ does not vary independently of $\mathbf{Z}_p$ even in principle, so the effect of a specific place characteristic may not be point-identified without functional form assumptions. In practice, Path 2 is a construct validity problem. A design gives you credible variation in something, and whether that something corresponds to the theoretical construct you care about is a separate claim — like a survey experiment that aims to prime one emotion but may prime others alongside it. When clean identification is not available, researchers fall back to inference to the best explanation, arguing for the characteristic-specific effect through abductive reasoning.

How the claude/codex diary works

A year ago, the main limitation of these models seemed to be memory. It was like working with a brilliant computer scientist who, for the life of them, could not remember what I was working on. This has gotten much better by default in the most expensive models, to the point where ChatGPT Pro will bring up conversations about other projects when discussing completely different projects, drawing connections and helping me place a question in the broader context of my research agenda. But as before, the infrastructure around these models is the real key to getting the most out of the intelligence embedded in them. There is a great deal that can be done with scaffolding, particularly having Claude or Codex keep detailed notes on what you are doing and cataloguing decisions and preferences, to make memory better and help the models stay on track.

So we do not have to rely on conversational recall to make Codex productive, and we can preserve memory across sessions and different computing environments by keeping track of what we do in durable files. In practice, that means storing context in repository files that Codex is told to read at the start of work, especially AGENTS.md, KNOWLEDGE_BASE.md, and MEMORY.md. KNOWLEDGE_BASE.md is for domain truth that should remain true across sessions: project definitions, architecture, naming rules, and recurring domain facts. In my repos, MEMORY.md is usually read near the start of any non-trivial task because the local AGENTS.md files explicitly tell Codex to read it before proceeding. So the file is not magical application memory; it works because the repo instructions repeatedly promote it into the model context at the beginning of work. Claude's design anticipates all this, and it's memory is much better and more automatic than Codex.

As I was thinking about this, I decided that Claude Codex should keep even more detailed notes on what I ask it to do, so I can refer back to them, since my own memory is also not perfect. So I set up instructions in each repo to take more copious notes, storing them in run logs. I then asked Codex to help me set up functionality where it does a regular sweep of each repo's logs, drafts a bulleted list of a sample of things it helped me do over the last few days, and turns that list into a Codex diary blog post.

It strikes me as an interesting example of both the high-level possibilities here, using better memory retention to increase operational capacity, and the low-level ones, like using the same system to write a blog post that explains what these tools can do. In all this I am basically an amateur, and more efficient or more elegant ways of structuring these models surely exist, or will soon. Mostly I just ask Claude/Codex whether something is possible and then have it help me design it.