Saturday, October 04, 2014

Wow I still have a blog

Doing some dusting and look what I found. Wonder what I can do with it these days...

Sunday, November 29, 2009

Pre-Installed repositories versus self built

Not sure how this got past me but I just realized I failed to discuss the possibility of downloading a pre-installed VM (Virtual Machine) versus building your own. Well better late than never...


I believe having both options is actually beneficial if you plant to first build a repository, configure it and get to know it and then perhaps either have it packaged for a new installation. Alternatively, if you have already built a prior version, but just need a quick, stand-alone system you could download a prebuilt one to avoid the needless downtime you would incur.


A pre-configured system (hopefully built to your needs) is helpful and provides rapid deployment of resources. For example, if you have an event or exhibit rapidly approaching and not much time to create a whole new web presentation unloading a boxed repository, like Omeka, will get you up an running within a couple of days. This system should have everything you need to either promote or enhance and the usability and discoverability of your digital objects.


On the other hand, learning how databases work and the various tools within your repository will ultimately provide greater functionality and a base understanding of what's going on behind the scenes. Why is this important? Long term maintenance, sustainability, and trouble shooting. Should something go wrong you'll have a much better understanding of where the problems could lie and how to get your system back up and running. You'll have a better opportunity to build your own tools and hook into your repository to connect to other services. Building for the "ground up" also provides the developer with more opportunities for modulatory construction.


My current technical skills are rudimentary at best as far as network administration and database development. I require a lot of instruction and reminders to achieve a fully functional repository up and running so I certainly see the advantage of having an out-of-box solution. In my case having one to create an initial web presence and making content discoverable is very advantageous but I would also want to then build that same repository from scratch and then replace the "out-of-box" with a customized and more functional system when it is available and then employ the OAI Object Reuse and Exchange protocol to seamlessly migrate the content to the new system.

Tuesday, November 03, 2009

Federated searching through OAI-PMH

Federated searching conceptually is the ability to search across repositories (or web pages) to get access to a large group of resources version only those resources and individual organization may hold. A useful federated collection is one that returns consistent results quickly. Past federated searches (via page scraping) were almost useless as inconsistent metadata and overall slowness from the various web sites the crawler might find made results unverifiable. A search on the same keyword only moments later could be entirely different than a previous one. Hierarchical results or scholarly worthiness were almost impossible to return. Metadata was not normalized and fields were assumed by the crawler based on CSS code and other style tags. The advent of the Open Archives Initiative: Protocol for Metadata Harvesting is a redirection of federated searches by creating a standardized system for open access collections to retrieve organize and ultimately index materials to create far more meaningful and faster searches.

Having a harvesting service that searches a standardized methodology for information will not only provide results with "Google" speed but in a consistent fashion as the right information is formatted in the correct locations. Another factor to consider is size. Some repositories hold far more data than others and whether a massive archive of data is better than a smaller, specialized search really depends on your needs and what you plan to do and how much of a knowledge base your research requires. It also depends on how the harvesting is utilized in a user interface. If the tools are clunky then users will not go there. Results need to be integrated into something that makes sense to the researcher but also brings the data to you. If you have to hunt through several layers of facets and taxonomies to get to your information you lose valuable time and probably look elsewhere for other resources.

For our latest course unit we were asked to search three to four repositories from the Open Archives service providers list and the University of Illinois OAI-PMH Service Provider Registry. Essentially all the searches I performed met these criteria, but the few that were unusable (not included in this article) were due to the original data provider either no longer participating or only creating an initial instance but no items to be found.

First I searched citebase Search from Southampton University. Citebase Search service provides users with the facility for searching across multiple archives with results ranked according to many criteria, such as creation date and citation impact. I searched for keyword "Medieval Castles" and retrieved one result. However, full text was not available from server. Through the same collection I searched the "H1N1" in reference to the current health scare and received 69 records in 0.153 seconds. The collections represent in the results were from PubMed Central, Virology Journal, and BioMed Central. All were appropriate scholarly collections.

Next, I searched the Sheet Music Consortium at UCLA. The Sheet Music Consortium is a group of libraries working toward the goal of building an open collection of digitized sheet music using OAI-PMH, which provides an access service via the standardized metadata to sheet music records at the host libraries. Representing collections are UC Los Angeles, Indiana University, Johns Hopkins University, and Duke University. I queried the term "ragtime" and found much better easy to use interface where actual repositories were clearly labeled within each browse item. I also noticed an interesting tool allowing the user to annotate citations, although it looks like you have to be signed in to utilize that feature. I also went through List Virtual Collections link and notice it listed which items required a password to view. I found that only being able to search one collection or "all" at the same time via the advance search problematic. It could be useful if a researcher wanted to search across a large collection of materials from only a couple of repositories if they were looking for some specific aspect of sheet music. Of course they could also just search each collection separately if granularity was really an issue.

I then looked at Cornucopia from The Museums, Libraries and Archives Council. Cornucopia is an online database of information about more than 6,000 collections in the UK's museums, galleries, archives and libraries. Cornucopia is very cross disciplinary and expresses that in their colorful and graphically based splash page. The initial page is very text based but once you get into the searching you are within what appears to be a standardized, text-based Institutional Repository listing. For example, I chose subjects, then coins & metals, and then I chose the Medals collection from the Royal Signals Museum. The citation for the collection resolved but unfortunately the link to their web page was broken which brings into question how up-to-date the aggregation is. It appears that all the results I gathered were about the collections but not necessarily the items themselves which as a student researcher I would find frustrating. I do not think this service was intended for itemized research but to point you to where certain types of materials are held. I didn't find this altogether useful without the direct linking to articles as in the other harvested collections.

Finally I searched the MetaArchive from the Library of Congress. The partner institutions of this project are engaged in a three-year process to develop a cooperative for the preservation of at-risk digital content focusing on the culture and history of the American South. There are about twelve institutions within the providers list and again I didn't see a way to actually search the archives. There was no search box. I also could not find any type of real search box but a series of links starting at Electronic Thesis and Dissertations (ETD) and Southern Digital Collection (SDC). I clicked on "Collections" got a list of titles and then chose "The History of Blacksburg, Virginia" which ultimately led me to an actual article. This particular collection although full of valuable resources was very time consuming and not seems to have a lot of work left for it to be usable. Perhaps this information is being utilized by another search service making the data far more useful with a standardized search method. Right now, you have to do a lot of hunting and clicking to find things.

So the real key here is user interface. It appears we've overcome a hurdle of gather data, now an issue is making that data come together in a meaningful and useful fashion to the user. They need a way to quickly find and compare information and then get that information back out to either use or cite as needed.

Tuesday, October 27, 2009

Cataloging when you are not a cateloger

No doubt providing metadata to content is a challenging process. There is what you would consider to be common sense, for example a journal article about feline leukemia would obviously be placed in the subject areas of cats but what about more granular areas and to what degree do you include metadata? Do you also put the article in veterinary medicine, pets, and/or cancer research? Fortunately we have metadata specialists, or catalogers, who are experts in taxonomies to handle this but I am not one of them and I have to go off what I assume to be the correct fields.


In the case of my trial Eprints repository I chose to use content I created for the Library Channel and in doing so I had to work with both the categories and tags we use to organize our materials for navigation on the website. Immediately I threw out our categories because they are specific to campus and function, which does not translate to a generalized repository. I did however use our tags (which technically are designed as a web 2.0 form of classification for user creator classifications) because those better represent the subjects each programs deals with and were provided by myself and a real life librarian who oversees our productions. Those items were then plugged into the default keywords field since that particular area is not a controlled vocabulary and allowed for greater flexibility.


Secondly, I took the main concepts of the videos I was ingesting, that they are produced at a university, by a library and deal with specific topics and then picked from the selection of Library of Congress (LOC)subject headings to actually indicate the items subjects. I believe that some future seeker of information looking for content about library instruction or the topics they dealt would use this information to find the materials. The more looser, granular topics are in keywords and can be searched there as well.


I can’t say I was overly concerned about consistency other than maintaining similar subject fields and using the keywords that were originally used to facet the materials from their original publication interfaces. With such a limited collection this was not a problem, however if the collection were to grow I would have to be more concerned that each item was receiving the same level of care and consideration as it is being ingested. I would also go back look at trends. I did for example take a look at the items by subjects and noticed out of six objects that one was had a lower number in the “Library Sciences” subject heading and was able to correct it accordingly.