Welcome to the Library Data Archive

Robert E. Molyneux

This is an archive of data and reports about libraries. Before the broader introduction, I'd like to provide short links to the data available now, those planned, or documentation or related publications in digital form. After this section, there is a broader discussion of this project.

Links to library datasets, documentation, and reports

Data available now (all from US Government sources)
Source of data Type of library Link to publications Years Notes
NCES-IMLS US Public Library Data PLDF3 FY 1987-FY2020 This is a longitudinal recompilation of the annual Administrative Entity US public library from NCES-IMLS. This is a large file.
Open source.
NCES-IMLS US State Summary Public Library Data PUSUM FY 1992-FY2020 This is a longitudinal recompilation of the annual State Summary/State Characteristics public library from NCES-IMLS.
Open source.
NCES-IMLS A collection of publications summarizing and reporting on characteristics of libraries from results of the various surveys published by NCES-IMLS about the US Public Library Data. There are also publications about the history of these data series. Reports FY 1983-FY2020 This is a large collection of publications. Note that currently, the earliest publication is was about the 1977-78 US public library data collection. There is a large collection of such early publications. Collecting them converting them to a digital format will be a formidable task..
Open source.
NCES-IMLS US Public Library Data Survey Documentation Documentation FY 1987-FY2020 The annual publication of these data comprises three series: The State Summary/State Characteristics file [State library data], The Administrative Entity file [public library data], and the Outlet file [branch libraries.] The documentation is for all three of those files. There is no longitudinal file for the outlet data.
The documentation for PLDF3 and PUSUM are at those links above. These files used the NCES-IMLS data but rearranged them.
Open source.
NCES-IMLS US public library data--raw data files Annual data FY 1973-FY2020 These data go back to FY1973, so they also predate the FSCS era. This earlier series was the LIBGIS (Library and General Information Survey.)
Open source.
NCES Academic Library Statistics ALS 1970-1971 through FY 2012 Early years are from the Higher Education General Information Survey (HEGIS)
Open source.
NCES School Library/Media Center publications SLMC 1974 through 2013 Early years are from the Library General Information Survey (LIBGIS)
Open source.
NCES Federal Libraries and Information Centers FLIC 1994 Open source.

These data are library data or analysis collected and published by either the National Center for Education Statistics or the Institute for Museum and Library Services. For more on those two agencies, see: NCES-IMLS introduction

These two sets of are from non-US government sources and available at the links below.
Source of data Type of library Filename and link Years Notes
Princeton Compilation Data from the Princeton Compilation Princeton [Academic years] 1919/20- 1943/44 I keyed these data when I was working on the The Gerould Statistics. Open source
Purdue Data Academic library data series of 58 academic libraries Purdue from 1951 Are these the actual Purdue data? I believe so but it is a tangled web discussed at the link.
Open source.

The following are not open source because I did them for the agencies listed. The are “works for hire.” I would need permission of those agencies to distribute the data. These were based on the infrastructure of the Stubbs-Buxton Cumulated ARL University Library Statistics.

Works for hire
Source of data Who owns the data? Type of library Filename and link to ARL Infrastructure Years Notes
Gerould Statistics ARL Academic library data begun in 1907/08 Gerould background.
[ARL infrastructure]
[Academic years] 1919/20- 1943/44 I keyed these data when I was working on the The Gerould Statistics for the Association of Research Libraries
ARL has produced derivative products and owns these data.
Survey/Compilation ACRL Academic: Historically Black Colleges and Universities HBCU [ARL] [Academic year] 1988-89 Not a longitudinal series but one of the collections following the ARL structure
ACRL members ACRL ACRL libraries not in ARL ACRL [ARL] 1978/79-1987/88 Also followed the ARL structure and used the ARL form. Essentially, ARL surveyed (roughly) the largest 100 academic libraries and ACRL surveyed the (roughly) second 100 libraries.
Gerould/ARL ARL Research Libraries Research Library Statistics [ARL] [Academic years] 1907/08-1987/88 This was a compilation issued in digital formats, with a guide. It was the first time the Gerould and ARL data were joined in one series.

The Goal of the Library Data Archive

We librarians are not kind to our data. While we organize, store, and preserve human records, sadly, are not good about preserving or archiving data about libraries.

This archive has a number of types of data about libraries. It has collections of raw data in various digital formats and it has publications about libraries--increasingly in pdf format. The largest part of this archive, currently, are data collected and published by US government entities, although I will be publishing a great deal more from other entities that I have collected over the years from both government and private organizations. I hope to find all US government publications about libraries and digitize them. It is a daunting prospect.

There is a subset of these publications—those which published data collected systematically, over a number of years, of a defined population of libraries. The two big categories are public libraries and academic libraries. There are quite a number of such series but longitudinal files of these data, are typically not done by the issuing agency. Characteristically, these data are issued one year at a time without an explicit infrastructure to bind the data from the various years together. The Association of Research Libraries (ARL) being the notable exception. It is the steward of the longest running such longitudinal series which is now over 100 years.

This archive was started as a result of my experience in recompiling such longitudinal data series of library data—which work was a result of research interests. As mentioned, when systematically collected population library data are compiled and published, they are characteristically published one year at a time and normally without reference to previous or future data but often continuing practices from previous years, although not always. The major use of such data seems to be for comparisons between a given library with others for...budget presentations and related. But data collected for one purpose can be used for others like using them to examine trends in library practices and finances, for example. To create a longitudinal file of library data, the various individual annual publications must be found and converted—these days— to digital formats. But what is a “Longitudinal file of library data?”

These are data collected and arranged over time so that one can study trends. Given the facts of publication, it is necessary that the annual data be rearranged with a date field and other accommodations required to accurately present the data as published. The person recompiling these annual data will normally add dates of collection if they are not there and may well add other fields. For instance, in the largest file in this Archive is PLDF3, the US Public Library Data File. In the original annual publications, the libraries have a key variable called the “FSCSKey” which for reasons explained at the link, were not usable in the longitudinal file to do what key variables must do: provide a single variable to identify each entity over time. Libraries change names, addresses, etc., as the years pass so a key variable is a critical item for a longitudinal file. There is an exceptionally long, intricate, and regrettably, dull discussion of the solution I developed at the link that involved creating and adding a new key variable. It was a hard problem. Note, though, that the variables in the original are not changed but new ones added so the longitudinal file is a superset of the original with added infrastructure for analysts. The compiler must first do no harm. Happily, most series I have worked on have had less complicated key variable characteristics. They often make up that loss in other ways, though.

I have recompiled a number of such series and too often, finding all years is a challenge because they get lost by our library colleagues. After some experience with this fact, I resolved to keep copies of all such compilations I have done and to gather library data where I can find them. That work is to be documented here. Periodically, I look for library data on the Web. These digital files are but a small part of all library data available. There are enormous collections of data publications in paper that have not been converted to digital formats. There is so much that could be done with them but librarians are rarely numerate and the value in these archives is not clear to our colleagues nor is there a critical mass of people skilled in data in our field. I was in the University of Toronto's iSchool library as a collection of such paper copies was being boxed up and moved to storage. Box after box. All that potentially useful information about library trends off to cold storage.

These collections do take up space...and who uses them? Working librarians have many constraints and one historical one in archives is space and librarians rarely are comfortable with numerical data. This Archive will be expanded to include either data I have found or data I have recompiled and I would welcome the opportunity to compile some of those data locked in paper. Let me add that some of the oldest numbers we have as a species are library numbers. Libraries are a key to what humans do because libraries are a part of the memory function. We are not the fastest species and there are those who will say we are not the smartest but we have institutions that organize our memories and what we have learned. Libraries are one of those institutions.

A review of longitudinal publications of library data

The oldest longitudinal recompilation I am aware of is College and University Library Statistics (Princeton, 1947.) It is referred to here as the “Princeton Compilation.” In the early 1980s, when I first ran across them, the data were referred to as the “Princeton” data/statistics and the received wisdom was that 1920 was the beginning data of this series. The publication itself is a bound typescript of a longitudinal rearrangement of the annual data sheets from the 1919/20-1943/44 academic years.

In recompiling these data and trying to understand what their history was, I talked to Haynes McMullen at the (then) School of Library Science at the University of North Carolina who researched the early history of US libraries and used data in those discussions. At the time I asked him if he knew about these data, which I had thought began in 1920 when I walked in his office that day, but Professor McMullen reached in his files and pulled out a virtually complete run from the first data (that is 1907/08) through at least 1920. It turned out he got them while working on his dissertation. But imagine my surprise to discover the series began before 1920. I have found the first publication of these data thought provoking. This copy of the 1907/08 academic year shows 14 libraries and six variables that has grown into the astonishingly rich ARL Data we have today. This copy of that first issue was obtained by Nicola Duval at ARL from the archives at the University of Minnesota to include in the monographic of The Gerould Statistics. What a wonderful surprise that was when she had obtained the copy and included it in the Gerould Statistics. She and I surprised Kendon with Gerould Statistics and Nicky surprised me.

Who was James Thayer Gerould? Gerould was the director of the University of Minnesota library in the early part of the 20th century and had an idea. He wrote an article that appeared in Library Journal in 1906: “A Plan for the Compilation of Comparative University and College Library Statistics.” A committee was appointed by ALA and as far as I can tell, the committee never made a report. Undaunted, Gerould went ahead and started collecting and reporting data. The first year had data from 14 state university libraries from the 1907/08 academic year (linked to in the preceding paragraph) and continued in the following years. He moved to Princeton in 1920 where he continuing compiling these annual data. After he retired in 1938, the compilation continued at Princeton and these data were commonly referred to as the “Princeton Data” since the history had been lost. Chapter 2 of The Gerould Statistics discusses the details of this compilation with an assessment of it. This series is discussed here further.

This first longitudinal recompilation was published at Princeton with the data from 1919/20 through 1943/44 academic years of an expanding number of US academic libraries. While the data were published annually and could easily have been ordered by year—as all other such series in this Archive are—the compilers rearranged the data by institution. It was a bold undertaking with the available technology. This series is discussed here further with the data in various formats. Chapter 2 of The Gerould Statistics also discusses these data with an assessment of them.

The next one is the Purdue series. I believe I have a copy of these data but I am not sure for reasons discussed here. Again, Chapter 2 of the Gerould Statistics discusses these data in some detail.

The next longitudinal data compilation is the Cumulated ARL University Library Statistics, 1962-63 through 1978-79 by Kendon Stubbs and David Buxton (Washington, ARL: 1981.) This is the seminal work and is foundation of the work I have tried to build on. Kendon is certainly the best data analyst the library field has produced. His data work was in addition to his being the Associate University Librarian at the University of Virginia (UVA). For this project, he took the printed annual university ARL Statistics and with David Buxton converted these data to a digital format. Originally, it was on a mainframe and available via computer tape. The digital publication of the ARL data has continued and grown since their work. Moreover, he adduced the principles to be followed in such work. His interest in textural integrity informed much of those principles. The Introduction is worth a quiet reading annually. A bit more on the organization of these data are outlined in the discussion of the ARL data structure. This original structure was adapted to a series of academic data compilations based on data collected using the ARL data collection instruments. The data themselves and their documentation are owned by various agencies and the discussion here of this structure is an overview to provide a comparison of the structure of the public library data series I have also compiled.

I found the Stubbs-Buxton publication when I was doing research on what turned out to be my dissertation and got intrigued. I thought I would update this publication with more recent data for what I had become interested in. I am skilled in data input so I keyed the data and checked the published data with the digital Stubbs-Buxton data. I found discrepancies. I wrote him and we talked and resolved those problems. It turns out that data can be corrected after publication and there were errors. What do you do then? You correct the digital copy and it becomes the master copy. Version control is always an issue.

The Library Research Service has graciously offered to house this archive of digital data and reports on U.S. libraries by two agencies of the U.S. government: the U.S. National Center for Education Statistics (NCES) and the U.S. Institute for Museum and Library Services (IMLS). The NCES-sponsored program behind the collection and publication of the public library data was known as the Federal State Cooperative System (FSCS). See this useful timeline of this program for more information. Having watched this effort from close up, I can say it was an impressive organization that functioned well. IMLS continues a similar program as the Public Libraries Survey which continues the public library data series without interruption.

Sources of Library Data Outside the United States and a Look at Assessment of Libraries

The linked page is a work in progress and is as of September, 2016 although I expect to be delving into the North American library data in detail shortly. I have collected links to data sources I have worked with so this is an sample I hope to build on. I have also collected these data. It seems that library data tend to disappear without an archiving effort and that fact was the proximate cause of my starting this exercise about 20 years ago or so. The library world would benefit from an agency that performs the ICPSR function for our data.

While looking at BIX (Der Bibliotheksindex)—the BIX has closed and its Web site is no longer responding as of September 5, 2022,

December 14, 2022