Data emancipation

Readers of my blog know that I am an advocate for Open Data, whereby scientists permit the widespread distribution of data without restrictions. Data should be available to anyone at any time at no cost without any legal (i.e., copyright or IP) restrictions. This will enhance our abilities to follow up on research and reuse the data in whatever way we wish. In particular, reuse of data should be seamless and lossless.

I have noted many times that Supporting Materials in today’s journals is far from ideal. Often authors do not include data at all! Sometimes the data is corrupted, especially if data is being deposited solely through pdf, which involves the loss of almost all semantic information about the data. Unfortunately, very rarely is data deposited in a form that is readily reusable. For example, I make use of 3-D coordinates of molecules for this blog, and these are invariably deposited as simply text within a pdf. I then have to copy-and-paste this data into a new file formatted for use in some molecular viewer of my choice (for me, typically GaussView or Avogadro).

The leader in advocating and demonstrating chemical data reuse is Henry Rzepa (see his blog for many examples). He and his group have published a paper describing a system for separating data from the paper narrative – a process they call data emancipation – as part of the scientific publication process.1 I strongly encourage readers of this blog to take a look at this paper for the publication model they propose that places data at the nexus of the scientific process and makes it available for widespread reuse. Take a look at the web enhanced objects, such as this one (you might need a subscription to access this, but this link takes you to the Figshare site which is open), to see how data can be deposited for search, retrieval, and direct reuse. This is a model I hope many computational chemists will adopt. We also need to advocate with journal editors and publishers to establish similar procedures for all manuscript submissions.


New policy regarding articles I will blog

I was all set to write a review of an interesting study of bowtiene 1: its rearrangement to other C10H6 isomers and its dimerization. But as I was gathering my information, I wanted to prepare the images of the optimized geometries, and so I went to get the supplementary materials.

The author has a section on the supplementary materials that indicates it contains Cartesian coordinates – just what I need. (This section ends with the curious line “This material is available for free of charge via the internet at”; it’s curious because the article is in a journal not published by ACS. I’ll leave for speculation just what happened here, but clearly the copy-editing done by the Canadian Journal of Chemistry is not quite up to snuff!)

So, I went to the website and clicked on the link for the supplementary material and was then told that I did not have access to this material and that either I had to become a subscriber or I had to purchase access to the article. (I should point out here that I received this article through interlibrary loan.) This is the first time that I have run into a paywall to get supporting materials! I know I am probably lucky that it took 7 years before running into this problem. But that makes this situation so frustrating – just why is the Canadian Journal of Chemistry placing supplementary material behind a paywall, especially when so few other publishers are doing this?

Well, until I get the supplementary materials, I will not write a post about this article.

New policy: I will not blog about an article unless (a) there is information on the 3-D structure of the molecules, typically in supporting materials, and (b) this information is available for free. This requirement should really be the minimum for publishing computational chemistry results. Now, I would also hope that the coordinates are readily reusable – see Henry Rzepa’s post about recent problems he’s run into!

Displaying 3-D structures in the blog – a request for conversation

With the recent disclosure of a major security hole in Java, I have been wondering if perhaps my continued use of the Jmol utility for displaying 3-D molecular structures makes sense. Perhaps it is time to consider alternatives. (Right now if you click on a molecule image in one of my blog posts and you are using Firefox, a big warning sign comes up first requiring the user to actively decide to invoke the Jmol viewer. Might this warning be enough to discourage some users?)

In addition, the increasing use of mobile devices, which most often are not Java-enabled, suggests that moving to a new display option is warranted.

So, I am asking the community to participate in a conversation about how we might best address this issue in the (near) future. Is a Javascript widget the way to go? If so, which current program are people happy with? Or should we move to an HTML5 approach? And if this is the way to go, what tools are people suggesting?

If you want to see some fine examples of all three approaches (Java-based: Jmol, Javascipt-based: Chemdoodle, and HTML5-based: GLMol) I strongly encourage you to read Henry Rzepa’s recent brilliant article in Journal of Cheminformatics (DOI: 10.1186/1758-2946-5-6). This is a fantastic article to compare the old-school publication technology (as presented in modern day PDF form) and new-school enabled technology (what Henry calls a datument). First download the pdf version and read it, and then access the HTML version; I guarantee you will be impressed by the difference in the experience.

So, please chime in on what molecular viewers I might adopt for this blog, and perhaps we as a community might be able to encourage the use of and further the development of these enhanced publication technologies.

Hacking…or how I spent my Thanksgiving vacation

I have long been a proponent of the Internet as holding the potential for revolutionizing how chemists communicate. This blog represents one of the ways that electronic communication can enhance how we exchange ideas.

This blog began as a means for me to maintain the currency of my book Computational Organic Chemistry. I realized that as soon as the book was physically printed and distributed, it was already 6 months out of date, and every subsequent day the book became that much less current. But the blog provides a mechanism for me to continuously provide updates to the book. As new articles are published, I can comment on them with the same perspective as I brought to the book.

I have been blogging now for over 5 years: almost 300 posts discussing well over 300 new articles relevant to computational organic chemistry. While the number of comments and commenters has not been particularly large, many of these comments are quite astute and there has been the occasional quite interesting back-and-forth discussion.

While my blogging is not entirely altruistically motivated, this has been more of a labor of love than anything else. So one might understand my dismay when about two weeks ago I received email messages from Jan and Henry and Eugene telling me that when they tried to access the blog, their browsers came back with a malware notice message from Google. Apparently Google will scan sites for problems and most current browsers will poll Google for the health of these sites prior to actually connecting to them.

My blog became infected somehow, and now I had to figure out how to remove the infestation! Fortunately, my son is a CS guru and so when I visited him for the Thanksgiving weekend we set out to disinfect the wordpress installation. After a bit of poking around, we found that every css file associated with the theme had unauthorized byte-code. Once we removed all of that, we submitted the site for review by Google, but to no avail – the site was still infected. So, back to more searching and we discovered that many of the plugins were infected, as were other themes. So another round of removing the foreign code and resubmittal to Google, and finally we passed inspection. The blog is now running clean!

But what a pain! And all for some junk that simply referred people to other sites. This headache cost a number of hours of searching and cleaning and worrying – for no good reason at all. (And I had a free software consultant – Thanks D!) I must say that I came seriously close to deciding to chuck the blog entirely. The hassle of maintaining the site and fighting off spammers and the like are truly the seamy side of the web. If one ever hears the comment that distributing information on the net is “free” – remind them of the constant vigilance needed to ward off spammers and hackers and other vermin. And my little site is nowhere near as vital or subject to attack as say a bank, or a military base, or even a scientific publisher.

I appreciate more now the true cost of doing business on the web. I dismay about the future – the web is very much the “wild west” and lawlessness pervades. I worry that I (and others) may finally just give up. I wish I knew of a solution, but I realize that there is no way to perfectly secure a site.

So if anyone out there has a WordPress site and gets infected I can offer some advice for cleansing – and if anyone has advise as to how to stem the malware tide, please share!

My talk on blogging at the Skolnik Award Session

Henry Rzepa and Peter Murray-Rust were awarded the Henry Skolnik Award from the Division of Chemical Information of the American Chemical Society at the 2012 fall national ACS meeting in Philadelphia. Henry and Peter have been the pioneers in creating a chemicaly-aware semantically-rich web presence and have been leaders in pushing for open data and open source. Both of them have been quite influential on my own ideas and projects, and I have worked with them both on a number of projects. I was honored to participate in the symposium surrounding this award.

My talk was on blogging and chemical communication, and you can download my PowerPoint presentation.

Data sharing

Nature has a special feature on data sharing, including an editorial and commentaries advocating for both pre- and post-publication of data. I have long been an advocate of data sharing, especially in the post-publication sense (I would argue this is really concurrent-publication data sharing) – and one can read my latest commentary in the Journal of Cheminformatics (DOI: 10.1186/1758-2946-1-2.

Data sharing has been slow in chemistry. Peter Murray-Rust and Henry rzepa have been the major advocates for enhanced publication – see their blogs (PMR and HSR) and lots of publications. Tony Williams (blog) has been advocating for more deposition of data within ChemSpider, and some positive response have occurred. But journals and AUTHORS have been slow to change – supporting materials is often lacking important information and is rarely of useful form – and I consider pdf to be just a slice above “non-useful”. We need to continue to evangelize this issue!

Semantic web publishing

Another diversion from the main theme of this blog.

I have been an advocate for a revolution in chemistry publication making use of the technologies available on the net. My latest polemic on this topic is “Chemistry publication – making the revolution” (DOI: 10.1186/1758-2946-1-2) where I advocate for inclusion of more data within articles, enhancing the reader experience by being able to manipulate the data in the same way that the author did. I argue for development of tools that will enable publication of data, along with chemical semantics. Peter Murray-Rust has blogged on perhaps the first step in this direction: Chem4Word.

I ran across a very interesting article on a similar topic in Learned Publishing. The article is “Semantic Publishing: the coming revolution in scientific journal publishing” by David Shotten (DOI: 10.1087/2009202, also available from this repository). Shotten is in the zoology department and so comes to the semantic web with a different perspective, yet arrives at a similar place that I and Peter Murray-Rust and Henry Rzepa (and other chemists) have been advocating. Shotten advocates for “live data” and semantic markup – and cites Project Prospect (the RSC markup of chemical documents built on PMR’s work) as an example of this. Shotten includes a link to a sample zoology article that his group has “enhanced” and there are a lot of clever additions that chemistry publishers would be well served to examine – links to data, cloud tagging, customizable references, etc. Check out the enhanced document here.

Perhaps a growing push for “enhanced publication” from many disciplines will spur on action among the major publishers!

