These vagaries in data collection over the years stand here as they appeared in the original annual issues of the statistics (unless corrections were reported by individual libraries.) What was or was not counted in the past library data collection practices has proved too well hidden in the mists of time.
Cumulated ARL University Library Statistics, 1962-63 through 1978-79 by Kendon Stubbs and David Buxton, (page v.)
The data published in the FSCS/PLS series have imputations. That is, values that are missing or anomalous in the original forms may be changed based on editorial decisions and/or algorithms. In constructing PLDF3, I have tried very hard to remove all imputed data. I may not have succeeded but that was my goal.
Compiling data is a different task from analyzing data. They do use overlapping tools but the purposes of the activity are different. FSCS/PLS are compiled in order to analyze the state of US libraries. In analyzing data, anomalous and missing data can be a problem. Two characteristics of library data that can be difficult in analysis are anomalous and missing data. If the problem at hand is to analyze the state of US libraries, and you have these data, you will want to examine the data with care and in doing that, you will notice things–inevitably. Typically, data will go through an editing process. These days, a library's data are entered into some digital form and then they are compared to previous years data from that library. What if there are anomalous or missing data and no response to queries? I will discuss anomalous values in a bit but the documentation for the NCES and IMLS public library data have extensive discussions of the algorithms used to change or create data. These edits are intelligent, thoughtful, and seem reasonable for their analytic purposes. If those are your purposes, PLDF3 will not help you but you can download the data from IMLS or this Archive and create a data file similar to PLDF3 but with imputed data. I would suggest anyone doing that consult the revision history has my notes on what I believe are errors in the annual files that affect the universe of libraries. Similarly, the schedule of changes in the FSCSKEYs where the newkeys for the various libraries were added. Those files have what I believe are errors that will affect a PLDF3-like file of imputed data.
The compiler's role is different from that of an analyst. The compiler's task is to try to reconstruct the original data. “What was or was not counted in the past library data collection practices has proved too well hidden in the mists of time.” Or as I have said, the compiler must “first do no harm.” Changing numbers or making numbers up means creating a dataset made up of numbers arrived at through some theory about those numbers. We should derive our theories from numbers; not our numbers from theories.
Let's consider some anomalies
Elsewhere on this site, I discuss the Purdue data. To the best of my knowledge, the only documentation of these data were done in the Gerould Statistics as discussed there. Assuming that the dataset I had is the real Purdue data—and I feel confident they are—the problems I found suggest that there were changes that were made that were likely incorrect but also likely undocumented and, hence, careless. There are series where numbers do not exist in the putative source but were “omitted” but appear in the Purdue data. One dealt with library growth. In those years, it was the accepted conclusion that academic libraries grew exponentially. Knowing that, if you were to create some numbers, you would be inclined, I expect, to create something exponential. But looking at the numbers, it is not clear they did that but I can't be sure. In any case, if I came along in 1984 and wanted to examine the exponential growth hypothesis and found the Purdue data, then analyzed them, what if I found that, lo!, library growth sure looked exponential? I think succumbing to the temptation to do such things with a set of data is something not good. I don't have a word for it but when looking at a set of data like this, it is a text and there are principles to follow if one is trying to reconstruct the ur-data. The man who did that to Shakespeare's text, Thomas Bowdler, is immortalized with the word: bowderize.
Let's look at a few years of data from Princeton's library:
Princeton University Library Volumes Added, Select Years
1936/37 is anomalous, isn't it? What should we do? It cannot be right, can it?
One of the facts about digital datasets is that we do not have footnotes. The paper copies of the ARL Statistics have numerous footnotes about the various numbers but these footnotes do not get converted to digital formats. We do not have notes from those who talked to librarians about questions in data submitted. I don't know how including such notes and footnotes would be done but there is a great deal of information in them. As it happens, in this case, the Princeton Compilation has a footnote and it clarifies this number: The Princeton library included the data from the Gest Oriental Library with its 133,419 volumes with the 1936/37 data.
Absent the footnote, what should we do? Do not change the number because this is the best estimate we have. Libraries buy libraries, libraries burn, things happen. Libraries are uniformitarian in nature generally, but occasionally not. To change such a number, one must find an authoritative source, otherwise, this is another number hidden by the mists of time.
In the ARL data, there are two cases with an impossible relationship: the number of Volumes Added, Net are greater than the Volumes Added, Gross. In the first year of the Gerould data there was a number for books “Added Last Year.” The Gerould/ARL data add variables by several methods and one is disaggregation. This number was retitled “Volumes Added.” and then subsequently split into “Volumes Added, Gross” and “Volumes Added, Net.” Volumes Added, Gross minus volumes withdrawn = Volumes Added, Net. The net figure might equal the gross figure but it cannot be larger. We know we have at least one wrong number in each pair. But: which one? Might they be reversed? Could well be. What about mistyping them in the ARL Statistics? Maybe. There are all sorts of “maybes ” but absent some authoritative answer, it is just as well to leave them alone. You might go from one incorrect number to two.
First, do no harm. The compiler presents the best reconstruction of the data he or she can. Then the analyst who uses the data will find the appropriate techniques for that task.
Back to the PLDF3 main page
December 14, 2022