The Purdue Data— maybe

The original Purdue data were assembled by a group at Purdue University and were used in a series of nine publications entitled The Past and Likey Future of 58 Research Libraries from 1965 through 1973. A discussion of these data was included in The Gerould Statstics in Section 2.3 beginning on page 12 through page 19. The discussion in digital edition of the The Gerould Statistics is in that same section. I found a number of what I thought were serious problems in those data. The interested reader can go to the Gerould Statistics.

This was another compilation of data but, like I found with the Princeton data, there was almost no discussion of methodology and one particular one where the authors discussed assembling the data, they gave the sources they said they used. But, as I discussed, when I looked in those sources, often enough, the “source” was not, in fact the source of the data in the dataset.

One serious problem is one I wish to discuss here: making up data. There are no missing values in the Purdue data. Library data have characteristics and one is data will be missing from time to time. It just happens...characteristically...so no missing values will cause one to take notice. Here is what was said in discussing “omitted” data:

...in other cases, omitted data could be supplied by interpolations, extrapolation, or other calculations from the data at hand, (p. 5)

Given that there is no discussion of how this process was done or where it was done, a user of the data cannot undo whatever the compilers did. From that consideration and from other anomalies discussed in Gerould, one might want to be circumspect in using these data.

There is another problem which I think is more serious: when they were making up numbers, what principles guided their decisions? Remember that at the time, it was believed that academic libraries grew exponentially and that notion was enshrined by Fremont Rider; who said that this pattern of growth he asserted the data showed to be “on the order of a natural law.” Heady talk, that. So, did they make up numbers that were in an exponential series? And if one discovered libraries doubling in size by a constant interval would that result be an artifact of imputation or an underlying reality of how academic libraries grow?

In any case, we should derive our theories about libraries from the data and not the data from our theories.

But maybe these were not a genuine copy of the Purdue data. The text says the data existed as a deck of computer cards and such a deck was given to Kendon Stubbs at the University of Virginia by Warren Seibert who was one of the authors of the Purdue studies so there is reason to believe that this deck was a true copy. I never saw the deck and I can't remember if he or I was the one who read the cards onto a computer to work on.

There are 1,276 rows and 17 columns. 58 libraries times 22 years = 1,276 observations. The variable names are those used in Gerould, Purdue, and the ARL data, with exceptions noted below: This version is a superset of the presumed original Purdue data. That is, all the Purdue data are here but a few extra variables are included to make joining these data with other datasets easier.

Variable names used in the Princeton data
Variable name What it means Notes
LIBNO Library Number This is the Purdue key variable. Note that INSTNO—the ARL key variable—has been added.
YEAR Academic year Academic years normally span two calendar years. This number is the second of the two years. That is, if it says 51, those data are for the 1951/52 academic year.
VOLS Volumes Held
VOLSADG Volumes Added, gross Later ARL would add VOLSADN or volumes added, net
EXPLMB Expenditures for materials and binding
TOTSAL Total salaries For professionals?
WAGES For nonprofessonals?
PRFSTF Professional staff
NPRFSTF Non-professional staff
TOTEXP Total expenditures
LOWSAL Starting professional salary?
TOTSTU Total students
GRADSTU Total graduate students
INSTNO Institution Number This is the key variable used in all three of these publications and the ARL data at the time. I will have to get that list
INAM Institution Name
MEMBYR Year the library joined ARL ARL was founded in 1932.
TYPE Private/State
REGION ARL Region

I have the data here in several formats

The work was done in SAS but I no longer have access to SAS. I have here one file in SAS formats but can no longer read it.

The Princeton Data
Dataset name Notes
Purdue.sas7bdat A SAS format
Purdue.csv CSV tile

December 11, 2022
Back to the main page