Tuesday, October 27, 2009

Cataloging when you are not a cateloger

No doubt providing metadata to content is a challenging process. There is what you would consider to be common sense, for example a journal article about feline leukemia would obviously be placed in the subject areas of cats but what about more granular areas and to what degree do you include metadata? Do you also put the article in veterinary medicine, pets, and/or cancer research? Fortunately we have metadata specialists, or catalogers, who are experts in taxonomies to handle this but I am not one of them and I have to go off what I assume to be the correct fields.


In the case of my trial Eprints repository I chose to use content I created for the Library Channel and in doing so I had to work with both the categories and tags we use to organize our materials for navigation on the website. Immediately I threw out our categories because they are specific to campus and function, which does not translate to a generalized repository. I did however use our tags (which technically are designed as a web 2.0 form of classification for user creator classifications) because those better represent the subjects each programs deals with and were provided by myself and a real life librarian who oversees our productions. Those items were then plugged into the default keywords field since that particular area is not a controlled vocabulary and allowed for greater flexibility.


Secondly, I took the main concepts of the videos I was ingesting, that they are produced at a university, by a library and deal with specific topics and then picked from the selection of Library of Congress (LOC)subject headings to actually indicate the items subjects. I believe that some future seeker of information looking for content about library instruction or the topics they dealt would use this information to find the materials. The more looser, granular topics are in keywords and can be searched there as well.


I can’t say I was overly concerned about consistency other than maintaining similar subject fields and using the keywords that were originally used to facet the materials from their original publication interfaces. With such a limited collection this was not a problem, however if the collection were to grow I would have to be more concerned that each item was receiving the same level of care and consideration as it is being ingested. I would also go back look at trends. I did for example take a look at the items by subjects and noticed out of six objects that one was had a lower number in the “Library Sciences” subject heading and was able to correct it accordingly.

Tuesday, October 13, 2009

Stats and Digital Collections

This week I'm am just thinking out loud about what do you do about Stats and digital collections particularly if your collection is dispersed throughout the web and not necessarily housed locally.

The Association of Research Libraries (ARL) stats have come up once again this year. They are a big deal if you want to maintane your standing for bragging rights and accredation but as usual they are a challenge.Historically, ARL stats were about how many books and journals your library held. Now entering the digital realm we are asked about how many files we have. The challenge begins in one respect because we never know what they are going to ask for since the questions can change from year to year. We then have to start throwing these numbers together in a near panic. But how meaningful are they? How do we determine what files (how many there really are).

Currently we disseminate a lot of material via web services such as the Internet Archive, YouTube, Vimeo, iTunes U, and of course provide syndication through our blog and Feedburner (for podcast optimization). We also use Google Analytics to track statistical information about viewership and hits on our web pages and catalog. However, that only works for things we have control over and that we have already hooked into Google Analytics and let run over time.

Each of these services provide their own specific forms of statistical feedback. Google owns YouTube and Feedburner and both provide dynamic graphic maps of users and views. Feedburner shows you subcriber info such as what tools they use to get to subscribe (Google Feedfetcher, iTunes, Sage etc.) and gives you a rough sketch from the past day to the past year of your traffic. Vimeo gives you a basic view count and like YouTube will even tell you who is subscribed to your materials through their interface.

The Internet Archive also provides a download count on your item page but that doesn't always relfect the number you see in the browse interface. I've noticed the same information for files that filter from the Internet Archive through Feedburner's RSS engine don't always match as well. So if you use an outside party to host and assist in the serving of your materials, how accurate are their statistics. Are they padding or hiding results? Is the statistical analysis outdated?

On a simple file count issue, how do you handle derivatives? On a basic local level you might have an archival tiff or JP2 file with a corresponding set of transport versions. Do you count them all or just the ones made public? If the files are essentially the same content but modified for playback ability are they counted together or separately? Our files loaded externally to the Internet Archive also have format derivatives created by the IA itself. Do we count each of those and what about the separate files we additionally load to iTunes U.

We were recently asked asked about a particular digital collection we put online several years ago as part of a collaborative grant with GWLA which would have been no problem except after we put in our share of content (housed on a local server) we did nothing to promote it and certainly were not tracking the searches within our collection. There was no way for us to tell what was going on with it. Our catalog had a link to the material but only pointed to the consortium's main web page. (Note: The consortium uses an OAI-PMH Open Archives initiative protocol for metadata harvesting to access our materials in their search engine). We can see if anyone put in a search for the materials via our Google Analytics tracking of searches in the cataloge but as far as that goes it shows a big zero. So as far as we can tell the collection is an vastly unused assortment documents.

On a related note, we also are unable to retrieve statistics from our files we house in iTunes U. iTunes U servers are owned and operated by the Apple company but our the client brands our "channel" with our specific look and feel developed by ASU programmers. Unfortunately they have not had dedicated resources to hook into the API's which would give us stats.

Statistical analysis of digital collections is far, far more complex than counting bound materials on a shelf and the number of times they are checked out. We have to take into consideration servers, hits, views, downloads, searches, and file derivatives just to skim the surface. What does it all really mean and what is the information that really gives you the best feedback about your collections and their usefulness? The lesson learned here is to make sure when you are designing and developing your online collections that at some point you are going to be asked for stats and you had better have a system in place to provide real data. Of course anticipating exactly what those questions are is key.

Tuesday, October 06, 2009

DSpace on my VM


I actually installed DSpace this week and created a couple of collections. Unfortunately they are video and audio files and don't play in the interface. I will probably keep this instance and play with it overtime to see if I can get them to display properly. Currently I'm assigning test users and working with the metadata as well but here's the first look: