NetCDF-3 Storage Specifications

Genotypes come with a number of properties that make their usage and storage particularly computationally expensive, especially when we talk about Whole Genome Study genotype datasets (and further beyond, genetic sequences).

The storage and posterior retrieval of specific subsets of the data, at arbitrary places at arbitrary times (non-sequential) is a very desirable feature for the types of analysis that are being performed on these datasets. This goal has commonly been achieved through the use of Relational Databases with the help of SQL language.
Nevertheless, RDB technologies have poor scaling properties for the type of data that composes Whole Genome Studies. This is why sequential access of the data is the prevailing format choice for many of the current WGS applications.

Sequential access methodologies, such as parsing flat-text files, limit the type of operations you can do in a given time and on a given machine. They also precondition the algorithms that you may use to process the data, resulting in sub-optimal applications.
A reassessment had to be done to identify better storage and retrieval technologies, that would suit the problem at hand.

Hierarchical Data Formats

There are other scientific areas that have been confronted to similar problems since the 1980’s and 1990’s and by now a number of technologies have been developed to solve them. Paralellization and distributed storage and processing are one example, but they require a whole new layer of hardware and management structures which would be difficult to implement for the scope of the application that we wanted to develop.

What was needed was an application that was easy to use, agile, self-teaching, cross-platform, ran on a standard desktop and scalable (a tall bill!).

For some time, the meteorological community had been using a format called NetCDF for storing satellite images and other weather data. In parallel, the particle collisioners around the world were also generating huge amounts of data that had to be stored and then processed by scientists, who developed a format called Hierarchical Data Format (HDF) to help them.
Both technologies were related in concept and ended up merging eventually into what basically can be described as a format aimed at storing large multidimensional arrays of scientific data, with a focus on scalability.

GWASpi uses the NetCDF-3 technology as it is easier to develop stand-alone, cross-platform applications with it, even though it’s less feature-laden and agile as it’s HDF5 counterpart.

The downside with these new technologies is that you don’t have the SQL language to query the database anymore. Thus, you must construct the database in a way such as to be able to know where it’s parts are located. In GWASpi’s case, an index of samples and SNP markers is being constructed as you load genotype data in the system.
The user doesn’t have to worry about these low-level details, as the input and output will always be delivered in a range of familiar formats using an easy-to-use management interface.

Genotype Data Characteristics

The properties of Genotype can be described, in general terms, as follows:

  • The database is dense (no sparse datasets, little missing data)
  • It doesn’t change often after acquisition
  • The size of each array element is fixed and known (alleles)
  • Genotypes can be ontologized and fitted to custom datatypes (arrays and matrices)
  • The order of genotypes follows convened rules (chromosomes, physical position)
  • The genotypes can be indexed by an unique ID (dbSNP ID and/or proprietary IDs)

Considering these properties, a convenient NetCDF-3 database can be designed to fit our needs.

GWASpi’s NetCDF-3 database

The genotype data you will provide, along with it’s annotations and mappings, will be stored in a NetCDF-3 database as described below.

The database will contain metadata that describes the data it contains. It also contains global attributes that apply to all it’s content:

Dimensions

Sample Set Dimensions
Sample Set Dimension is set to UNLIMITED. This means that this dimension can be set to an arbitrarily large number that depends only on the physical size of your hard-drive.
The size of each element (ID) in this set is limited to 64 characters.

Marker Set Dimensions
Marker Set Dimension is set to the number of markers you will provide in your dataset. There is an upper limit to the number of markers you can provide for each Sample. The limit is the maximum unsigned integer number, 2³²-1, or 4.294.967.295 SNPs. Beyond this number, a new array would have to be generated to store further data.
The size of each element (ID) in this set is limited to 64 characters.

Genotye Matrix Dimensions
Genotye Matrix Dimension will be the multiple of the “Marker Set Dimension” × “Sample Set Dimension”. This will result in a matrix with “UNLIMITED” rows with “Marker Set Dimension” columns.
The size of each element (genotype) in this set is currently limited to 2 characters, one for each allele.

Attributes

Genotype Encoding
This attribute specifies the encoding the genotypes occur (ACGT, 1234, AB, 12, UNKNOWN)

Technology
An attribute field to specify the original technology the genotypes were provided in (PLINK, Hapmap, Affymetrix, Illumina…)

Description
User generated description of the matrix.

Metadata Arrays

Marker ID
This ordered array of dimension “Marker Set Dimension” contains all the Marker IDs in a specific order (chromosome first, then physical position). The Marker IDs may be equal to the dbSNP database ID (rsID), but can be different of the original technology deems so (e.g. Affymetrix).

Rs ID
An ordered array of dimension “Marker Set Dimension”.
This is dbSNP’s rsID code for every Marker, where available. may be equal to Marker ID array.

Chromosome
An ordered array of dimension “Marker Set Dimension”.
This is the stated chromosome location of each Marker. Chromosomes will be encoded and ordered as follows: 1-22, X, Y, XY and MT.
XY means pseudo-autosomal, MT means Mitochondrial.

Position
An ordered array of dimension “Marker Set Dimension”.
This is the stated physical position within a chromosome of each Marker. This position should refer to dbSNP’s reference database.

Encoding Dictionnary
An ordered array of dimension “Marker Set Dimension”.
This will hold a dictionary for translating a given encoding (AB or 12) to the dictionary’s counterpart. Only provided by specific technologies (Affymetrix).

Sample ID
An ordered array of dimension “Sample Set Dimension”.
This is an ordered array of Sample IDs, as provided by the user. The order will be the one given or the timeline order of opening the individual genotype files (depending on the input format).

Genotype Matrix

The genotype matrix, as stated above will be a 3 dimensional matrix, of dimensions “Marker Set Dimension” × “Sample Set Dimension” × “Genotype Dimension”.
The latter dimension is just a size specification of a genotype, in our case 2, one for each allele:

Writing Data to a Matrix

In GWASpi, everytime a dataset is loaded or in general, a new Matrix is created through any of the Operations that require it, data is written in the above specified way.
NetCDF-3 is particular in the way it writes the data in that you have to first define the shape (dimensions) of the data you will be writing, in what is called “Design Time”. Once that phase is done, the shape is set and cannot be modified anymore. GWASpi takes care of this step given the data you provide.

Next, the “Write Time” phase starts and genotypes data will be written to the matrix. Once that phase is done, the NetCDF-3 file will be wrapped up, and closed. After a NetCDF-3 file has been closed, it cannot be edited anymore.

This may seem like a major lack of NetCDF-3 but in our case, as genotype data is pretty much static, it doesn’t affect the processes involving it’s treatment. Any subsequent editing or processing, any Operation or analysis that has to be performed on this matrix will yield data that will be written in a new NetCDF-3 file, always keeping the original as is. This comes quite handy for keeping track of your workflow and raw data. In a nutshell, any Operation performed in GWASpi, short of a deletion, is non-destructive.

Retrieving Data from Matrix

The principal task required form its NetCDF-3 databases by GWASpi is to be able to read any fragment of data it contains, regardless of it’s position inside the dataset and in predictable subsets, labeled with the Marker and Sample headers. This is done by using the NetCDF-3 java API provided by Unidata UCAR and Java’s LinkedHashMap objects, which permit to program a fast and efficient way of manipulating arrays of data.

LinkedHashMaps

LinkedHashMaps (LHM) are ordered HashMaps, which combine the power of fast recovery of elements within an array, using a key, and a predictable sort order. This key will return the stored Value within the LHM. Also, contrary to traditional HashMaps, the order of key/value pairs is specified and known. In this case, it will follow the order indicated in the Matrix Marker ID and Sample ID Metadata Array.

In other words, you may access rows or columns of the Matrix, as well as subsets of rows or columns. Then, within this LHM you will be able to retrieve a given data point, by requesting a key, be it Marker ID or Sample ID depending on the type of LHM you read. You may also iterate through the LHM, getting every key/value pair in a specific, expected order.

Note: Theoretically, you could even read larger chunks of data, not limiting yourself to a single row or column. But taking into account the limited RAM you allocate to an application on a typical Desktop, this would be a bad strategy. Keeping the memory usage under control is crucial for safe functioning.

Other NetCDF-3 databases used in GWASpi

The above concept is also applied to other datasets that are generated by GWASpi, such as quality assurance, analysis data, reports and charts.
In the case of QAs and analysis data, the test results are stored in character and numerical 1, 2 and 3 dimensional arrays, as needed, along with their correct shapes, dimensions and metadata, so as to be able to retrieve the context the data has been generated for.
LinkedHashMaps are also used to read/write to the database.

Conclusions

Using HDF formats is an efficient solution for large scale static datasets. They adapt very well to the needs of genotyping and genome data. The quick read/write speeds as well as the growing development of tools and APIs using this technology paves the way for more and more bioinformatics applications.
Java’s LinkedHashMaps provide robust and fast methods to manipulate the data and ensure it’s coherence throughout all GWASpi’s operations. The above described formats, methods and ontologies are a good basis to build upon. Further operations, import and export formats, analysis results can easily be added, using the available APIs.