Thursday, July 16, 2009

Cataloging Collections, or a Catalog of Collections?

Q. What's the next best thing to cataloging your collection? A. Making a catalog of your collections. Confused? Well don't be. It's actually quite a simple idea, and it goes like this.

We in the natural science collections world face an enormous challenge. There are probably around 500,000,000 natural history specimens in US museum collections alone, perhaps 2 billion worldwide. They constitute a potentially enormous reservoir of data for research usage, but in order to be used they need to be cataloged.

In principal, cataloging is a straightforward process. You are applying a unique identifier, usually a number or a combination of numbers and letters, to a specimen. You apply the same number to the data that are associated with the specimen. Now you have a link between specimen and its associated data that you can use either to put the specimen in context, or to provide physical evidence for the reality of the data (when a specimen is used like this, we call it a "voucher specimen" - it is substantiating or "vouching" for the truth of the data).

Of course, it's more complicated than that and if you get it wrong there are a number of things that can happen, none of them good - if you go and take a look here, you can find out what some of those things are. Plus, there are a bunch of things involved in the actual cataloging of an object. It might involve trying to figure out the modern spelling of a phonetically rendered name of a village in the Belgium Congo, which is only found in a faded, handwritten, 100 year old letter from the collector. Or simply finding the specimen and writing a catalog number on it.

The upshot of all this is that cataloging takes a surprising amount of time. On a good day, an experienced collections assistant might do 50 specimens. Assuming that they actual get a day to do cataloging - as a long-term, "background" activity, cataloging usually takes second place to whatever the current priority is, be it dealing with a visitor, assisting with a public event, or moving collections as part of a construction project (all activities that my staff have dealt with this week, by the way). Cataloging is also a choke point in the collections workflow, because despite considerable advances in technology it's almost impossible to automate completely.

So how many of those theoretical 2 billion specimens are cataloged and available on-line? The answer is that I don't know, and I suspect no-one really does, but I'm going to take a punt and guess that it's less than 10%. If that's true, and I suspect I may be being optimistic, then we're looking at a mountain of work before global natural science collections reach their full potential. Are museums up to climbing this mountain? My feeling is that they're not.

To be fair, they never have been. For natural history collections there is an enormous asymmetry of effort between the work required to collect specimens versus the work needed to curate them. It's comparatively easy, for example, to use insecticide foggers to sample thousands, if not tens of thousands of specimens of insects from the rainforest canopy in a single day. But to process, sift, identify, catalog, pin, and house them can take years and the efforts of a army of trained collections staff. Museums have multi-faceted missions that encompass a wide range of activities in addition to collections care, but even if they were to devote all their resources to the management of their collections my guess is that they still wouldn't have enough staff.

Of course, there are tricks that can make the workload more manageable. One is to catalog by specimen lots. A specimen lot is a group of specimens collected at the same place and time. They are usually made up of the same organism. It's a good way to catalog groups like insects or planktonic animals and plants that are collected in large quantities. A single specimen lot can contain several thousand, or even tens of thousands of individuals under the same catalog number. Start doing this, and that 2 billion specimen number begins to fall to something a lot more manageable. Even so, it's still an enormous challenge.

So if museums can't catalog everything, then perhaps they can catalog what's needed most urgently. Ideally, this should be driven by demand - researchers should determine priorities for cataloging. But to do that, they need to know what's there to be cataloged. And that brings us back to my rather opaque opening paragraph.

At last week's SPNHC meeting at Leiden, I attended a workshop given by Walter Berendsohn (Berlin) and James Macklin (Harvard) on setting priorities for specimen digitization. Under discussion was a proposal from Berensohn and co-workers via the Global Biodiversity Information Facility (GBIF) to set priorities for specimen digitization (cataloging and databasing) that would be based on taxonomic and geographic criteria. What they were talking about was collecting collections metadata - not data from the collections, but data about the collections - to decide how limited resources should be targeted.

Suppose I'm working on a project that requires a bunch of distribution data for bird species from New Guinea. The idea that Berendsohn & Macklin were proposing in the workshop was that I could search an on-line database that would tell me where all the major collections of New Guinea birds are located, reagrdless of whether or not they've been cataloged and made available on-line. I find that there's an exceptional, but uncataloged collection of New Guinea birds in a museum in Holland. Not only that, but the database tells me that if I pay them 10 Euros a specimen they will do a priority, gold-standard cataloging job for me, including full georefering of all localities, etc. Now I can write a grant application including this cost and the museum gets some much needed support for cataloging.

Now obviously there are a few questions that spring to mind, the first of which is how do you make that initial database of collections metadata? Collections databases tend to be centered on information about specimens or specimen lots. A database that stored information about collections rather than specimens would have to be a new creation.

Then there's the question of how granular the data will be. Is it enough for me just to know that a museum has birds from New Guinea, or do I need to know what Families of birds are present, and from what province? The more detailed the data become, the more useful they are, but the more time and labor is required to capture them. At some point, you have to wonder why you don't just catalog the collection and have done with it.

And who builds and pays for this database? Museums could reasonably argue that while they're capturing collections metadata for this project they're not doing other things that make collections accessible, like cataloging specimens, rehousing, processing loans, or helping visitors. It can also seem like a redundant activity, because at some level museums do know what's in their collections; the issue is whether they know enough detail for the purposes of users. Ultimately it's likely to be a compromise - more than museums want to have to provide on their own dollar, but less than users would ideally want.

Finally, the process of advertising rates and charging hard cash for cataloging specimens would move museums explicitly into the role of service providers, something that many curatorial staff that I work with would be distinctly leery about doing. There is a move afoot among some advocates in our field to recast natural history collections as an enormous, globally-distributed research support facility, which would fit the pay-as-you-catalog model. There are certainly opportunities here, but there are also pitfalls. However, the sun is shining, my daughter wants me to take her blueberry picking, so this will have to wait for another post.

No comments:

Post a Comment