Handling Data and Metadata
If things work well the data and metadata standards used will rarely be seen by end users, but they will enable the storage of data within server instances, and the seamless transfer of data into or out of these systems. Flexible standards must have formats that can be used from a number of programming languages ranging from compiled languages such as C, C++, and Fortran on supercomputers/high performance computing resources to perform quantum chemistry calculations through to interpreted languages such as Python for data analysis and JavaScript/TypeScript in web frontends or C/C++ in desktop applications. They should also be suited to the needs of persistent data publication independent of any particular database technology or programming language.
These considerations led to the choice to make use of JavaScript Object Notation (JSON) \cite{json} as a core standard for data and metadata, with a view to using related technologies such as JSON-LD (JSON for Linked Data) \cite{data} where appropriate. Large data is preferably stored in binary formats, which is where HDF5 \cite{hdf5} was seen as one strong contender and more recently MessagePack \cite{small} has gained traction due to its JSON-like structure and wide language support thanks to its simple binary specification. JSON also lends itself to use in BSON \cite{serialization} and jsonb \cite{types} - two binary JSON specifications used in MongoDB \cite{apps} and PostgresSQL \cite{database} respectively.
In order to effectively share chemical data we must establish data and metadata standards capable of representing everything we wish to communicate. Further, it must offer routes to extending the standards without causing breakage and churn in existing data. Ideally communities should form to establish best practices, and propagate this to a number of codes to prove viability and offer a body of work that demonstrates the advantages of the approaches shown. A number of existing formats have been used such as XYZ \cite{wikipedia}, SDF, XML-based formats such as CML \cite{Phadungsukanan2012,Murray-Rust2011,Murray-Rust2011a,de2013} and more recently JSON-based formats such as Chemical JSON \cite{Hanwell2017}. Open Babel \cite{O_Boyle_2011}, RDKit \cite{software}, cclib \cite{O_boyle_2008} and ASE \cite{Hjorth2017} offer conversion between these and many other formats, aiding in the normalization of data from different sources when ingesting data.
The JSON specification was chosen for its wide language support, with fast simple parsers available in every language considered including those outlined previously: Python, C, C++, Fortran, JavaScript and TypeScript. It is not a part of the core language, but it has become established as a simple container format that can easily map to data structures in all of these languages, for example Python dictionaries, C++ maps, and with a little work the array and some shape information easily maps to NumPy arrays \cite{van_der_Walt_2011} in Python or one of a number of structures in C++ such as the Eigen \cite{eigen} matrix structures or Vector3 for position information. The text format is easy to inspect for casual users, and there are clear routes to migrate to very similar binary containers in the future as outlined earlier. Due to the simplicity of the container format fast parsers can be used, and the loss of precision is of a secondary concern for visualization. It will of course be very important if exchanging data between quantum chemistry codes, and would warrant further consideration there.