PLDF3 Background


The U.S. National Center for Education Statistics (NCES) has published a number of series of library data and with the public library series it continues a rich tradition of compiling and publishing library data by the Department of Education and its predecessors that goes back to the seminal Public Libraries in the United States of America: Their History, Condition, and Management, (Washington, GPO: 1876) (commonly called the “1876 Report”.) The report was published by the then Bureau of Education. There are actually older library data published by the US government but the subject of this history will have to wait for another time. A good place to start is Robert V. Williams, “The Making of Statistics of National Scope on American Libraries, 1836-1986: Purposes, Problems, and Issues,” Libraries and Culture 26(2): 464-485 (Spring, 1991.) When PLDF3 was created, the series was the technical work of NCES with the National Commission of Libraries and Information Services (NCLIS) in an advisory role. Since NCLIS was closed, the Institute of Museum and Library Services has continued the series and data for the years since IMLS has managed the collection and preserved its infrastructure so PLDF3 has continued with that infrastructure.

The name of this dataset discussed here, “PLDF3,” reflects the fact that it is in an evolutionary continuum. In the discussion about the PLDF3 variables, the reader will see that over the years, a number of variable names were used to describe essentially the same variables and defined with wording rarely changing. Zipcodes are recorded in variable called ZIP in PLDF3 which was the name used in the 1998-2004 datasets. However ZIP1 was used from 1991 to 1997, LIBZIP from 1988-1990 and FLDE in 1987. (Note that in the dataset, NCES variable names are in uppercase while variables I created are lowercase and italicized here for clarity.) PLDF1 was the first merge of the various datasets and in it, all four of these variables representing the zipcode—indeed all the various variables used in each year—appear in the merged data. Thus, ZIP1 data were in the 1991 to 1997 data but space for ZIP1 was included for the other years. LIBZIP data were included for 1988 to 1990 and empty spaces kept for he other years. And so n. The result was a dataset that had something close to 150 variables and was about 230 megabytes in the master dataset in SAS format.

A spreadsheet available in now three formats: (.csv, .ods, and .xlsx has a list of these variables by year. Using PLDF1 would have required an analyst to sort through the documentation and variable names and to write code to collapse the same variables with its various names into one variable name. This collapsing of variables is what PLDF2 did and it has to be noted that the excellent and professional documentation from NCES made this work possible. PLDF2 had far fewer variables than the 150 or so in PLDF1 as a result of collapsing the like variables into one variable name. PLDF2 was 125 megabytes as a SAS dataset.

A problem arose if one wanted to use PLDF2 to track changes through time: are the changes we see from year to year the result of underlying changes in the conditions of libraries or a result of the fact that each year we may be measuring different sets of libraries as new libraries open and old ones close? For a number of purposes, it would be useful to analyze only libraries reporting each year so that we compare, as the saying goes, “apples to apples” or to be able to analyze any given set of individual libraries or one library through time. Because of the way the data were published, such analysis was impossible with PLDF2 and this led to the complex process of creating PLDF3. This process is discussed in some detail in the documentation of that dataset. Creating PLDF3 required the construction of a unique key variable, that is, an identifier for each library so that individual libraries can be analyzed. The NCES variable FSCSKEY appears to be such a number at first blush but is not. A large number of libraries have more than one such key in PLDF2 and a few have as many as five. My guess is that, on average, each library in the full dataset has about two FSCSKEYs. This fact is an artifact of the history of the collection of these data and a reflection of a characteristic of library data collected annually from a population in a systematic fashion. Once a new key variable that is unique to each library over time was constructed, then it is relatively easy for someone analyzing the data for group libraries—or one library—by the span of years these libraries reported data. The new identifier would also have other uses. PLDF3 differs from PLDF2 by virtue of the new variable newkey which resulted from an attempt to create a key variable unique for each library over time. How this was done is a complicated story and is discussed in the main PLDF3 page.

Back to the PLDF3 main page
December 14, 2022