URLs as references

This past week, I submitted a final draft of an application note on some software I’d written (and am still writing, for that matter), and had it rejected twice because I’d included a URL as a reference. (The first time, I failed to notice that I’d cited postgresql 8.4 with a URL, in addition to Picard.)   As both a biochemist and a bioinformatician, I can see both sides of the story as to why that would be the case, but it still irked me enough that I thought it worth writing about.

If you look back 30 years ago, there really wasn’t an internet, and so this wasn’t even an issue on the horizon.  How you cite non-peer reviewed material was the same way you cited anything else: you gave the author’s name, the date of publication and the publication company – books were books, regardless of who paid to have it published.  Publications were all copyrighted by some journal, and scientists would read articles in the library.  Access to scientific information was restricted to those who had access to universities.

20 years ago, the Internet was a wild frontier, mostly made up of an ever changing network of modems.  what was on one computer might not be there next time you connected.  Hard drives failed, computers disconnected – and no one put anything of great value on bulletin boards.

15 years ago, web pages began to pop up, URLs entered into public consciousness and editors may have had to face the issue of what to do about self-published, transient information:  Ban it.   That was the response, as far as I can tell.  Why not?  It might not be there 2 days later, let alone by the time articles went to print.  A perfectly reasonable first reaction to something that failed to meet any of the criteria for being a reference.

Just over 10 years ago, we had google.  Suddenly, all of the information on the web was indexable and you could find just about anything you needed.  You could before that too, but getting from place to place was a mess.  Does anyone remember Internet Yellow Pages, where URLs were listed for companies?  Still, information then had a short shelf life.  Even the WayBack Machine archive was young, and information disappeared quickly.  Still unsuitable for referencing, really.  You could count on companies being there, but we were still in the days that urls could change hands for a fortune.

5 years ago, social media invaded – now you had to be online to keep up with your friends.  But, there was also a major shift behind that – bioinformatics went from being just a series of perl scripts to being composed of major projects.  Major projects went from being small team efforts to being massive collections of software.  We also saw the adoption of web tools, many of which weren’t published, and probably never will be.  We went from dial up to broadband… we went from miscellaneous computers to data centers.  We went from hobbyist software projects to sourceforge.  In short, the Internet matured, and the data it held went from being a transient thing to being a repository of far more knowledge than any book source.

It didn’t, however, become peer reviewed.  Many people no longer consider the Internet to be transient, but with major influences like wikipeda, which is unreliable as a reference at best, we don’t often think of URLs as being a good reference.   But how is that any different from books?

Unfortunately, somewhere along the line, I think journal editors confused their initial reasons for rejecting URLs (the transient nature) with something else: the lack of peer review.  No editor would bat an eye at citing a published book, even if that information was not peer reviewed, but citing wikipedia seems like such a terrible idea that perhaps the slippery slope fallacy has reared it’s ugly head.

For bioinformatics, many of our common tools aren’t built by scientists any more, or if they are, they’re open source: the collaborative work of many people, which means they’re not going to be published.  Many of them are useful toolkits that don’t even make sense to publish – but they are available on the web at a fixed address that doesn’t expire.  Unlike commercial products, open source projects may die, but they never disappear when they’re hosted at the likes of sourceforge – which means they’re no longer transient.

While common sense and many colleagues just tell me to get over it and just “put the URL in the text”, I fail to see why this is necessary.  Can’t editors see that the Internet is no longer a collection of random articles?

Hey Editors, there’s far more to the Internet than just wikipedia and facebook!

(NOTE: Ironically, as I write this, Sourceforge is doing upgrades on it’s web page, and some of the projects they host have “disappeared” temporarily…. but don’t worry, they’ve promised me that they’ll be back shortly.)

5 thoughts on “URLs as references

  1. Dear fejes,

    First, I wish to thank you for putting online interesting material from AGBT 2011.

    THE main problem with citing intellectual works with uniform resource locators (URL) — addresses that link things together on the web — is perhaps that the permanency is not guaranteed.

    As such, using these volatile handles to build a knowledge graph will not warrant the global structure integrity in the long run. The latter is an undesired effect for the scientific literature.

    Digital object identifiers (DOI) form a readily-available solution to citing things on the web.

    I am sure you already know about these.

    Example:

    Digital object identifier: doi:10.1093/bioinformatics/btn305

    Here, 10.1093 identifies the publisher (Oxford University Press)

    URL (using the DOI): http://dx.doi.org/doi:10.1093/bioinformatics/btn305

    In turn, querying this URL returns a HTTP status code 302, meaning that the document resides elsewhere.

    http://www.bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btn305

    Although the DOI is a permanent handle, the URL pointing at the publisher servers could disappear, should something bad happen to the publisher.

    That is why we have public repositories too

    http://pubmed.gov/
    http://ukpmc.ac.uk/
    http://pubmedcentralcanada.ca/

    Now, since the U.S. Library of Congress has a Twitter archive, I think it is safe to cite tweets.
    http://www.nytimes.com/2010/05/02/business/02digi.html

    -sebhtml

    • Hi Sébastien,

      Thanks for the comment – I agree with everything you’ve said, however, many software projects also have “permanent” links, that is to say, that the “permanency” of the URL depends on the quality of the source, rather than the nature of URLs themselves. For instance, the open source project Picard gives only it’s URL for citations. Since the project itself is open source and the provider (sourceforge) does not delete abandoned projects, the URL is appropriate to use as a reference.

      I am certainly not proposing that ALL URLs be allowed in references, but that the blanket ban on them is nonsensical. Perhaps the solution is to ask the U.S. Library of Congress to archive all open source projects.

      Cheers!

  2. I’ve seen URLs just included parenthetically in the text of manuscripts and grants, too, e.g. when referring to a company that supplied/will supply a service, or to software etc. However, some funding agencies absolutely forbid the inclusion of URLs in the proposal, apparently because it gives you an unfair advantage over other applicants (WTF?!). The last time we applied there, we got around the ban by saying “we will use Tool X (see X company website). So it’s a totally ludicrous and meaningless rule.

    • Wow… that’s insanity. I really hope they have better reasons than that for it, but somehow, I’m not going to hold my breath waiting to hear them.

  3. Pingback: Fuzzier Logic » Blog Archive » Automatic citation processing with Zotero and KCite

Leave a Reply

Your email address will not be published. Required fields are marked *