Tuesday, November 03, 2009

Federated searching through OAI-PMH

Federated searching conceptually is the ability to search across repositories (or web pages) to get access to a large group of resources version only those resources and individual organization may hold. A useful federated collection is one that returns consistent results quickly. Past federated searches (via page scraping) were almost useless as inconsistent metadata and overall slowness from the various web sites the crawler might find made results unverifiable. A search on the same keyword only moments later could be entirely different than a previous one. Hierarchical results or scholarly worthiness were almost impossible to return. Metadata was not normalized and fields were assumed by the crawler based on CSS code and other style tags. The advent of the Open Archives Initiative: Protocol for Metadata Harvesting is a redirection of federated searches by creating a standardized system for open access collections to retrieve organize and ultimately index materials to create far more meaningful and faster searches.

Having a harvesting service that searches a standardized methodology for information will not only provide results with "Google" speed but in a consistent fashion as the right information is formatted in the correct locations. Another factor to consider is size. Some repositories hold far more data than others and whether a massive archive of data is better than a smaller, specialized search really depends on your needs and what you plan to do and how much of a knowledge base your research requires. It also depends on how the harvesting is utilized in a user interface. If the tools are clunky then users will not go there. Results need to be integrated into something that makes sense to the researcher but also brings the data to you. If you have to hunt through several layers of facets and taxonomies to get to your information you lose valuable time and probably look elsewhere for other resources.

For our latest course unit we were asked to search three to four repositories from the Open Archives service providers list and the University of Illinois OAI-PMH Service Provider Registry. Essentially all the searches I performed met these criteria, but the few that were unusable (not included in this article) were due to the original data provider either no longer participating or only creating an initial instance but no items to be found.

First I searched citebase Search from Southampton University. Citebase Search service provides users with the facility for searching across multiple archives with results ranked according to many criteria, such as creation date and citation impact. I searched for keyword "Medieval Castles" and retrieved one result. However, full text was not available from server. Through the same collection I searched the "H1N1" in reference to the current health scare and received 69 records in 0.153 seconds. The collections represent in the results were from PubMed Central, Virology Journal, and BioMed Central. All were appropriate scholarly collections.

Next, I searched the Sheet Music Consortium at UCLA. The Sheet Music Consortium is a group of libraries working toward the goal of building an open collection of digitized sheet music using OAI-PMH, which provides an access service via the standardized metadata to sheet music records at the host libraries. Representing collections are UC Los Angeles, Indiana University, Johns Hopkins University, and Duke University. I queried the term "ragtime" and found much better easy to use interface where actual repositories were clearly labeled within each browse item. I also noticed an interesting tool allowing the user to annotate citations, although it looks like you have to be signed in to utilize that feature. I also went through List Virtual Collections link and notice it listed which items required a password to view. I found that only being able to search one collection or "all" at the same time via the advance search problematic. It could be useful if a researcher wanted to search across a large collection of materials from only a couple of repositories if they were looking for some specific aspect of sheet music. Of course they could also just search each collection separately if granularity was really an issue.

I then looked at Cornucopia from The Museums, Libraries and Archives Council. Cornucopia is an online database of information about more than 6,000 collections in the UK's museums, galleries, archives and libraries. Cornucopia is very cross disciplinary and expresses that in their colorful and graphically based splash page. The initial page is very text based but once you get into the searching you are within what appears to be a standardized, text-based Institutional Repository listing. For example, I chose subjects, then coins & metals, and then I chose the Medals collection from the Royal Signals Museum. The citation for the collection resolved but unfortunately the link to their web page was broken which brings into question how up-to-date the aggregation is. It appears that all the results I gathered were about the collections but not necessarily the items themselves which as a student researcher I would find frustrating. I do not think this service was intended for itemized research but to point you to where certain types of materials are held. I didn't find this altogether useful without the direct linking to articles as in the other harvested collections.

Finally I searched the MetaArchive from the Library of Congress. The partner institutions of this project are engaged in a three-year process to develop a cooperative for the preservation of at-risk digital content focusing on the culture and history of the American South. There are about twelve institutions within the providers list and again I didn't see a way to actually search the archives. There was no search box. I also could not find any type of real search box but a series of links starting at Electronic Thesis and Dissertations (ETD) and Southern Digital Collection (SDC). I clicked on "Collections" got a list of titles and then chose "The History of Blacksburg, Virginia" which ultimately led me to an actual article. This particular collection although full of valuable resources was very time consuming and not seems to have a lot of work left for it to be usable. Perhaps this information is being utilized by another search service making the data far more useful with a standardized search method. Right now, you have to do a lot of hunting and clicking to find things.

So the real key here is user interface. It appears we've overcome a hurdle of gather data, now an issue is making that data come together in a meaningful and useful fashion to the user. They need a way to quickly find and compare information and then get that information back out to either use or cite as needed.

No comments: