Seminar Schedule




MCSC 310


Web Scraping and Capture-Recapture: Can they really be used to produce official statistics?

Linda J Young, Ph.D.

Chief Mathematical Statistician and Director of Research and Development

USDA National Agricultural Statistics Service

The research at USDA’s National Agricultural Statistics Service (NASS) is focused on developing improved methods for meeting NASS’s mission to provide timely, accurate, and useful statistics in service to U.S. agriculture. Examples of the wide-ranging, current research projects will be discussed. One of these projects will be explored in more depth.

NASS maintains a list frame of all known farms and potential farms in the United States (US). Although extensive effort is made to keep this list as complete as possible, not all farms are on the NASS list frame. The farms in the emerging sectors of agriculture, including urban, organic, and local foods farms, tend to be smaller, more transient, more dispersed, and more diverse than the traditional farms in the rural areas of the US. They are also less likely to be on the NASS list frame. For the 2012 Census of Agriculture, NASS used capture-recapture methods to account for not only the undercoverage of the NASS list frame but also for nonresponse and misclassification for its Census of Agriculture. The two capture-recapture samples were the respondents from (1) the NASS list frame and (2) the June Agricultural Survey (JAS) sample drawn from the NASS area frame. A challenge with using the area frame for the second sample is that the types of farms that are often not well covered by the NASS list frame tend to be sparse in the JAS sample. Thus, NASS has been evaluating the use of web-scraped list frames as a second frame from which a sample could be drawn within a capture-recapture framework to assess undercoverage for not only the census but also for surveys. The projects that have been conducted in this area as well as a large feasibility study that is now underway will be presented. The assumptions underlying these methods and their validity will be discussed. Open questions will be highlighted.