This is a continuation of some work I’ve been doing with the Mozilla Science Lab and their ‘code as a research object’ program. There’s multiple aspects to this project including work on code and GUI prototypes, discussions around best practices for making code reusable and software citation. This post explores some ideas around linked data and machine readable descriptions of software repositories with the goal being to make software more discoverable and therefore increase reuse.
» JSON-LD
JSON-LD is a way of describing data with additional context (or semantics if you like) so that for a JSON record like this:
{ "name" : "Arfon" }
when there’s an entity called name
you know that it means the name of
a person and not a
place.
If you haven’t heard of JSON-LD then there are some great resources here and an excellent short screencast on YouTube here.
One of the reasons JSON-LD is particularly exciting is that it’s a lightweight way of organising JSON-formatted data and giving semantic meaning without having to care about things like RDF data models, XML and the (note the capitals) Semantic Web. Being much more succinct than XML and JavaScript native, JSON has over the past few years become the way to expose data through a web-based API. JSON-LD offers a way for API provides (and consumers) to share data more easily with little or no ambiguity about what the data they’re describing.
» So what about software?
Over the past few months there’s been a lot of talk about finding ways for researchers to derive (more) credit for code. There are lots of issues at play here but one major factor is that a prerequisite to receiving credit for some piece of code you’ve written is that a peer needs to both be able to find your work and then reuse it.
The problem is, it can be pretty hard to find software unless there’s a standard place to share tools in that language and the author of the code has chosen to publish there. Ruby has RubyGems.org, Python has PyPI, Perl has CPAN but where do I go if I’m looking to find an obscure library written in C++?
Discovering domain, language and function specific software is an even harder problem to crack. Sure, if I write Ruby I can head over to RubyGems to look for a Gem that might solve my problem but I’m relying on both the author to write a descriptive README and my ability to search for terms that include similar language to the author of the package.
For many subjects where common languages don’t benefit from canonical package indexes and the function of the software is relatively niche, then just finding code that might be useful is a problem.
» Towards a (machine readable) description of software
One way to address this discoverability problem is to find a standard way of describing software with context for the terms used. A design goal here should be that these files can be almost entirely automatically generated.
Inspired by the package.json
format
prescribed by the npm community
and using an ontology described on http://schema.org below is a
relatively short JSON-LD document that describes the
Fidgit codebase. Let’s call it
code.jsonld
for now.
» Minimal citable form
Note the first two line (@context
and @type
) defines the context for
the key/value pairs in the JSON structure so that name
means the
name of the codebase. You can see the full ontology for Code
here but this should mostly be straightforward
to understand1.
Once we get to the authors attribute we’re now entering a new context,
that of an individual. As we’re still using the schema.org ontology for
type Person
we only need to set the @type
attribute here.
There are a bunch more attributes that we could set here but this feels like a minimal set of information that is sufficient for citation (and therefore credit and attribution for the author).
» For data archivers
This next example is a slightly modified version of the minimal. This includes multiple authors2 but now also has keywords required by folks like figshare and Zenodo who require these terms. (Note these keywords should probably be more explicitly structured rather than relying on comma-delimited strings.)
» For discovery?
I started by describing the problem of software discovery and how domain, function and language specific searches for tools is hard. So far these JSON-LD snippets don’t really help with this problem as we still only have keywords and a description for describing the software function and domain.
The schema.org Code
ontology includes a
programmingLanguage
attribute which solves for doing language-specific
searches. At GitHub we’re pretty good at detecting this automatically
with Linguist and so it’s not even
clear that an author of a piece of software would need to manually
specify this (a win).
The challenge when designing a more ‘complete’ code.jsonld
document is
that it’s seemingly rather tough to automate a description of what
subject domain the software has been designed for and what the software
does.
PLOS ONE has a pretty decent subject taxonomy that I’ve extracted into a machine readable form here and so it’s possible something along these lines could be used to assign a subject domain. Thus far, I’ve been unable to find a good schema for describing academic subjects (or any subject domains). Going deeper and attempting to describe also the function of software is also proving challenging.
» Feedback please!
At this point I’d love some feedback on these ideas. The goal here is to promote software discovery and reuse, so framing this in what’s possible today is a good place to start reflecting on these ideas. Right now it’s possible to do a pretty advanced search for code on GitHub with facets for programming language, file extension, creation date, username and more. Imagine if you could do the same but add in subject area and software function?
One major pitfall with this idea is that in order for an index of
code.json
files to be useful people have to start making them - a
classic chicken and egg problem. All is not lost though, pretty much all
of the minimal code.json
file can be auto-generated and perhaps
submitted to authors as a pull request patch by a friendly
robot?
One of the biggest barriers to reusing research software is finding the damn stuff in the first place - does this help?
» Links
- Description of a project (DOAP) - https://github.com/edumbill/doap
- Schema.org - http://schema.org
- JSON-LD.org