Friday, June 1, 2007

Gimme That Scientific Paper!


What irks me about references cited on web pages is that you can't directly get the PDF or at least immediately search for it unless the page author has explicitly put a link to that paper. When a page author has taken the time to construct these links, they often point to a 404 (page doesn't exist) because the link is no longer working. In the digital age, surely this sort of thing can be done more effectively. Well, thanks to Rod Page who has developed a reference parsing tool in his bioGUID suite of applications, this functionality along with some nifty Flash-based, cross-domain AJAX that I used, is now possible. For a taste of this, have a peek at the references page in The Nearctic Spider Database.

Now, what the heck is going on here? Glad you asked.

The chain of events I think is very cool.

First, I simply wrap individual references in uniquely identified and sequentially numbered identifiers and put a "holding" span at the end of these with similarly identified span elements:

<p><span id="bioGUIDref_1">This is the first full reference.</span> <span id="bioGUIDres_1"></span></p>

That's pretty easy for anyone with a rudimentary knowledge of HTML.

Second, I put a reference to a JavaScript in the page header whose functions initialize when the page finishes loading. That script counts all the references with the simple mark-up shown above and puts little search icons in the holding spans. Of course this script can be hosted elsewhere and anyone can put it in their page headers.

Third, I put this mark-up at the bottom of the page that initializes a Flash item, which coordinates some cross-domain search functions via Rod's reference parsing API (more on that below):

<script type="text/javascript">FlashHelper.writeFlash();</script>

So, for the end user seeing a list of references with these little search icons stuck at the end of each of them as such:

Agnarsson, I. 2004. Morphological phylogeny of cobweb spiders and their relatives (Araneae, Araneoidea, Theridiidae). Zool. J. Linnean Soc. 141: 447-626.

...it's a simple matter of clicking each in turn to perform a real-time search for individual papers of interest [Disclaimer: of course the above example doesn't work here in this blog post]. If the paper is found somewhere in the ether, the icon changes either to in the case of a freely available PDF (yay!), if the paper can be found via other means (subscription may be required), if the reference was successfully parsed and searched but nothing was found, or if the reference was not successfully parsed and consequently a search couldn't effectively be constructed.

The really awesome part of this whole system is that it is laughably easy for anyone with a basic knowledge of HTML (no complex coding required!) to duplicate these functions on their authored web pages. But let's first have some background on how this works.

This cross-domain AJAX querying system uses Flash. Julien Couvreur worked with Jason Levitt (from Yahoo) to create an XMLHTTP transport that uses Flash. You can read about this in Julien's blog, Curosity is Bliss, where he also has a nice demo that produces search results from Yahoo's ImageSearch API using this technique. What Rod had to do was first get his reference parsing script to produce XML and also had to create a simple crossdomain.xml document and dump it in the root folder for his domain. Julien points out a potential security issue with these Flash-based cross-domain search queries so Rod at the moment only has The Canadian Arachnologists' domain in his crossdomain.xml document.

An end user clicking initiates a cross-domain request to Rod's machine. The reference is parsed in Rod's Perl script (i.e. split-up into Author, Year, Title, Publication, Pages, etc. as required for OpenURL) then sent off to CrossRef and elsewhere to obtain search results. This system works fantastically well for modern publications that have bought into CrossRef's DOI system (note: handles are also working in Rod's Perl scripting) but what about all those scientific societies that produce online PDFs but haven't bought into DOI's?

For smaller societies and publications like the Journal of Arachnology, Rod unfortunately must scrape the URLs to their digital reprints. [Aside: JoA does have DOIs, but these are issued from BioOne and an end-user accessing JoA articles via BioOne would of course be presented with a pay-per-view screen - sucky] In these cases then, the XMLHTTP system I have that sends citations to Rod's machine might return an erroneous link to a PDF if the source URL was changed & Rod hadn't yet updated his listings. But, as long as societies agree not to mess with their URL structure, the conduit to their PDFs remains viable. This is most certainly something The Encyclopedia of Life can coordinate.

Here then is a very slick little system that is easy for web page authors to implement and intuitively obvious for end-users. A potential pitfall worth mentioning is poorly constructed citations. Rod's algorithms that split a citation into is constituent bits are only as good as what goes in. In other words at my end for example, if a icon is returned to the end-user, a digital version of the paper might exist somewhere - I just didn't construct my citation well enough for Rod's algorithm to split the bits into an OpenURL format. So, I am contemplating adding a icon to sit alongside the icon such that an end-user who knows the paper can be found online can send me a quick note/poke to tell me that I need to re-write the citation.

If you want some background on what Rod did at his end, head over to his blog where he wrote about OpenURL Here & Here.

1 comment:

David Shorthouse said...

I have been cleaning up this tool as much as I can. In particular, what I realized I could do is dynamically create the "holding" span for all the little icons rather than have a page author create it ahead of time. That significantly reduces the mark-up they have to put into their pages and now really makes this easy. For example, this is all that is required for individual references:

<p><span id="bioGUIDref_1">This is the full reference.</span></p>

The JavaScript now does the work.