Data format and Storage — SixLinks Wiki Archive

Basics

So this page details the specs of how exactly we're storing all that bloody data. It is VERY subject to revision until it's nailed down. Take right now, when I'm making it up on the spot.

Data. There are four forms we need to archive.

Raw Data (data_raw) - This is the PDF, Excel sheet, whatever where we get the raw data. We must maintain a local copy of this at all times.
Cleaned Data (data_cleaned) - This is plain text, usually .csv files that have been converted from whatever form we got in the raw. No real parsing or changing of the data happens here.
Parsed Data / .sixml files (data_sixml) - This is SixLink's data format. It's described below. Note that if fine-tuning of categories, text is going to happen, it should happen in the .sixml file.
Vis Data (data_vis) - These are (mostly) .xml files that the actual charts and such use. They include formatting information, and just the data we need to display.
Templates (data_template) - This isn't data per se, but rather the template and header used to generate the data_vis.

.Sixml format

The file is composed of groups of datasets. Each data set contains:

Date
Comments
values. [ n ... y ... m ] (note that \"n\" and \"m\" can have a value of \"rel\", which indicates that they should be scaled to the relative mins and maxes of the set.)

The file also has a header for the data, applicable to all datasets (this means that each date's set must be in identical order).

name row. ... x ... max (min/max is an override)
goodness. [ 0 ... z ... 100] Goodness is a value judgement we have about how good a particular column is relative to others. It's used in display (color, size, etc). For example, wind power might have a goodness of 90, where coal would be down at 7.

Transformations

Whenever possible, the data transforms from raw to sixml should be automated, and the code preserved. We're pretty platform agnostic on this. All code is written in python, and stored in the 'transformationApps' folder.

Transformations from sixml to visualizaion xml must be automated, and will be a part of the back-end visualization code. To begin, we'll have the viz code regenerate manually (one of us hits the button when new data goes live), but eventually this system could be automated. Importantly, transforms will not be done real-time on the live site. For real-time, updated data, we'll be pulling from the database. If we run into a graph that really does update very often, we can re-evaluate this system.