11 min readThe PUMSdata Team

What Is PUMS Data? A Field Guide to the Census's Most Powerful Public Dataset

Nearly every statistic you have ever read about the American population — the median household income of a metro area, the share of renters under 35, the commute time in your county — is a summary. Someone at the Census Bureau decided which table to publish, and you got the cell they chose. PUMS is what sits underneath those tables: the raw, record-level survey data itself. Learn to use it and you stop being a reader of statistics and start being a producer of them.

PUMS stands for the Public Use Microdata Sample. The operative word is microdata— data at the level of the individual record rather than the aggregate. Instead of a table that tells you that 36.9% of American adults hold a bachelor's degree, PUMS hands you a (de-identified) sample of several million individual people, each one a row, each row carrying hundreds of attributes: age, occupation, income, educational attainment, housing tenure, commute mode, health-insurance source, and on and on. The aggregate statistics are not given to you. They are yours to compute — for any group you can define.

Tables versus the raw material

It helps to be precise about the distinction, because it is the entire point. Most Census products — the ones you reach via data.census.gov — are pre-tabulated. The Bureau runs a survey, computes a fixed menu of cross-tabulations, and publishes those. This is enormously useful and covers the questions most people ask. But it is, by construction, a finite menu. If the table you need was never on it, you are out of luck.

PUMS inverts the arrangement. The Bureau releases the (anonymized) underlying records and lets you tabulate them yourself. The cost is that you have to do the tabulating — correctly, which, as we'll see, is less trivial than summing a column. The benefit is that the menu becomes effectively infinite. Any question you can phrase as a filter over the variables, PUMS can answer.

What a single row actually is

The records come from the American Community Survey(ACS), the rolling survey that replaced the old decennial “long form” in 2005. The Bureau contacts roughly 3.5 million addresses a year. The resulting 1-year PUMS file is about a 1% sample of the U.S. population — on the order of 3.4 million person records, nested inside their housing units. (There is also a 5-year file, roughly a 5% sample, which trades timeliness for precision and finer geographic detail.)

Each person record is a long vector of coded variables. A few you will meet constantly: AGEP (age), SCHL (educational attainment), OCCP (occupation, drawn from a code list with hundreds of categories), and household-level fields like HINCP (household income) and TEN(whether the home is owned or rented). The values are numeric codes, not words: educational attainment is an integer from 1 to 24, not the string “bachelor's degree.” Decoding them is the first of several reasons PUMS has a reputation for being unfriendly — and why we publish the full PUMS data dictionary in plain, linkable form.

Why microdata has to exist: the combinatorics of curiosity

You might reasonably ask why the Bureau doesn't simply publish every table anyone could want. The answer is arithmetic. Suppose you want to know about renters, aged 25 to 34, with a bachelor's degree, living below the poverty line, in a particular kind of place. That is a four- or five-way cross-tabulation. The number of possible cross-tabulations across hundreds of variables, each with many categories, is astronomically larger than any agency could compute, store, or that anyone would want to wade through. The combinatorics defeat pre-publication.

Microdata is the Bureau's answer to a problem it cannot solve by enumeration: it ships the ingredients instead of trying to ship every possible meal.

This is precisely the slice of the data world where the published tables fail people, and it is the reason microdata is the working substrate of empirical economics, demography, and social science. If a study quantifies some specific, oddly-shaped subpopulation, there is a good chance a microdata file made it possible.

The weight is the whole game

Here is the concept that separates someone who uses PUMS correctly from someone who quietly produces wrong numbers. A PUMS file is a sample, and not a simple random one. It is the product of a complex survey design — stratified, clustered, with some groups deliberately sampled at different rates and with adjustments for who responds and who doesn't. As a result, the records are not equally representative. You cannot just count them.

Every record therefore carries a survey weight PWGTP for persons, WGTPfor housing units. The weight is, in intuition, an expansion factor: roughly, the number of people in the full population that this one sampled record stands in for. In a 1% sample the weights average around 100 — one record speaking for about a hundred Americans — but they vary from record to record precisely because the design isn't uniform. To estimate any population total, you don't count the matching records; you sum their weights. To estimate a mean or a median, you compute it weighted. Ignore the weights and your “estimates” describe the quirks of the sample rather than the country.

Weights also govern a subtler thing: uncertainty. Because PUMS is a sample, every number it yields is an estimate with a margin of error, and the complex design means you cannot get that margin from the textbook square-root-of-n formula. The Bureau's solution is to ship 80 replicate weights alongside the main one — 80 slightly perturbed re-weightings of the sample. You compute your statistic 81 times, once with the real weights and once with each replicate, and the spread across those results, via a method called successive difference replication, gives you a defensible standard error. It is more work than most people expect, and skipping it is how confident-sounding but statistically meaningless claims get made.

Geography, and the privacy bargain

Microdata creates an obvious tension. The Bureau is bound by law (Title 13 of the U.S. Code) to never release information that could identify an individual. But a record rich enough to be analytically useful — exact age, occupation, income, household composition — is also a record that, paired with a precise location, could single someone out. The resolution is to coarsen the thing most dangerous to privacy: geography.

The finest location PUMS will tell you is the PUMA, or Public Use Microdata Area — a region drawn to contain at least 100,000 people. There are roughly 2,400 of them covering the country, and they are the smallest geography at which microdata is released. You can learn an enormous amount about a PUMA, but you cannot zoom below it, and that floor is a deliberate privacy guarantee, not an oversight. (You can browse every state and PUMA to see exactly what that resolution looks like in practice.)

The same logic produces two other features that surprise newcomers. Extreme values are top-coded: ages above a cutoff are collapsed to a ceiling, and very high incomes are capped, because a 105-year-old or a uniquely high earner in a small area is identifiable in a way a typical record is not. And dollar figures come with adjustment factors (ADJINC, ADJHSG) that you apply to put income and housing costs into consistent, constant dollars. None of this is the data being evasive; it is the disclosure- avoidance machinery that makes a public microdata file possible at all.

What it is actually good for

Once the mechanics are in hand, the range of questions opens up dramatically. PUMS is the tool of choice when the question is specific enough that no standard table covers it:

  • Labor and the economy — wage distributions within a single occupation, the prevalence of remote work by industry and age, how earnings vary with education for a defined group. The granularity is what makes labor economics with PUMS so productive.
  • Housing — cost burden (rent or owner costs as a share of income) for renters versus owners, by household type, in a given region — the kind of cut that drives housing policy and underwriting alike.
  • Demography and migration — the size and characteristics of narrowly defined populations: recent movers, multigenerational households, specific ancestry or language groups, veterans by era of service.
  • Markets and planning — sizing a customer or constituent segment in a place, which is just a population estimate for a multi-variable filter — exactly the operation PUMS is built to support.

Data journalists, urban and economic-development planners, policy and nonprofit analysts, academic researchers, and market analysts all lean on it for the same underlying reason: it answers the question you actually have, not the nearest question someone else chose to publish.

How to read a PUMS estimate honestly

Because it is a sample, intellectual honesty requires a few habits. Treat every figure as an estimate with uncertainty, not a census count — and be especially wary of small subgroups, where only a handful of (weighted) records may underlie a number and the margin of error balloons. Use the 1-year file for timeliness and the 5-year file when you need precision or smaller geographies. Adjust dollars to constant terms before comparing across years. And when a result hinges on a difference between two estimates, check whether that difference is larger than its combined margin of error before believing it. These are not reasons to distrust PUMS; they are the conditions under which it tells the truth.

The catch — and why this usually gets left to specialists

Everything above explains both why PUMS is powerful and why it has historically been the province of people with a statistics background and a comfort with code. The files arrive as large fixed-width or CSV downloads of numeric codes; the variables must be decoded against a dictionary; the weights must be applied to every total and every average; and honest uncertainty requires juggling 80 replicate weights. Each step is surmountable, and together they are a meaningful barrier — which is exactly why so much of what PUMS could tell us goes unasked.

Removing that barrier is the entire reason PUMSdata exists. We decode the variables, apply the weights for you so every number is a proper population estimate, and let you build a multi-variable segment or map a measure across all 50 states and 2,400+ PUMAs by pointing and clicking — no downloads, no code. The statistical substrate of modern social science, in other words, without the apprenticeship. If this field guide made you want to ask a question of your own, you can start exploring it free.

PUMSACSCensusmicrodatamethodology

See it for yourself

Map and cross-tabulate the weighted ACS PUMS for every state and 2,400+ PUMAs — no downloads, no code. Free while we're in beta.