Sunday 3 July 2016

Converting MongoDB semi-structured to structured schema - Part I


How to convert semi-structured MongoDB schema into fully structured tabular/SQL schema? As we know MongoDB is a schemaless database that stores JSON documents in so-called collections. Schemaless means, that each document in collection holds its own schema: name and type of the field. JSON does not support directly types which are necessary to store data for fast read and write. Hence MongoDB storage is BSON based. BSON is a strongly type serialization format, and binary counterpart of JSON. Having stored JSON document we can ask MongoDB to provide count of documents field that match specific type, for example:

db.collection.find({'field_name':{$type:0}})

will return the number of documents where field_name is of type long. Most of MongoDB types can be easily converted to SQL counterparts: int->int, string->varchar(), ISODateTime->datetime so on.
During converting semi-structured MongoDB schema to structured schema, there are two structures that will need special attention: embedded document and embedded arrays.

Flattening JSON documents
Below I demonstrate a simple JavaScript based approach to flatten JSON documents in MongoDB. The script will open MongoDB connection, select database, collection and start iterating over all documents. I'm not a JavaScript developer hence I had to scrap the internet to find my initial script. After a few amendments, I came up with something like script below. First, load below script via MongoDB shell:

function convert(jobName,databaseName,collectionName){
conn      = new Mongo();
db        = conn.getDB(databaseName);
collection= db.getCollection(collectionName)
cursor    = collection.find();

while ( cursor.hasNext() ) {
   flatten(cursor.next(),
    db.getCollection(jobName),
    jobName,
    -1);
}}

var curID=0function getNextId(){
  return curID++
}
var done=falsefunction flatten(data, coll, parentName, parentId) {
    var currentId      = getNextId()
    var result  = {"id":currentId,"parent":parentId};
    function recurse (cur, prop) {
        if ( typeof(cur) == "function" )
            return        // scalar value        if (Object(cur) !== cur) {
            result[prop] = cur;
        } else if (Array.isArray(cur)) {
             // if current is array I don't include to object -             // in case of maps it prevents from creating massive schemas             var coll=db.getCollection(prop.replace(/-/g,"_"))
             for(var i=0, l=cur.length; i<l; i++)
             {
                 // spawn another flatten for each array                 flatten(cur[i],coll,prop,currentId)
             }
        } else {
            var isEmpty = true;
            for (var p in cur) {
                //print(prop+typeof(cur))                //if(p == "_id")                //    continue                isEmpty = false;
                recurse(cur[p], prop ? prop+"_"+p : p);
            }
            if (isEmpty && prop)
                result[prop] = {};
        }
    }
    // bootstrap    recurse(data, parentName);
    coll.insert(result)
    return result;
}

Then simply call convert function providing a unique name of the job, database name and collection name. When the function is finished you will have a whole bunch of new collections in your working database, with names starting with job's name and '_'. If your job name is for example 'job', your main collection will have name 'job'.
Each field inside documents is created by concatenating field name. For example document:
  {
   "_id" : ObjectId("57682626c3ea9b6005dc0949"),
   "methodId" : "65c05a05-f63e-4889-bdc0-2a7787a57cf9",
   "method: {    methodDate" : 20160602,    "job_method_version" : 2    }
  }
Will result:
> db.job.find()[0]
  {
     "_id" : ObjectId("57682626c3ea9b6005dc0946"),
     "id" : 0,//new field                       //we need id because we have to record parent-child relation     "parent" : -1,   //new field    //we need id because we have to record parent-child      relation                                   //,-1 indicates that this is top level collection.      "job_methodId" : "65c05a05-f63e-4889-bdc0-2a7787a57cf9",
     "job_method_methodDate" : 20160602,
     "job_method_version" : 2,
  }

JSON arrays
What do we do when we see an array in the document? Existing arrays will have their separate collections, for example, job_flags will be a collection that is derived from the array that sits in the main document under field name 'flags'.We just have to create another table and link current table with the new one using reference keys (parent-child relation). Then we keep analyzing what is in the array and if we find another one inside, then we do the same trick. When we finish we should have an SQL schema that resembles the snowflake schema well known from data warehousing, but in this instance, we don't have facts and dimensions but simple relationships. When we want to deal with a more flat structure we can solve it in two ways: use SQL views or, check the length of the array and when it is very short then just extend current table by adding new columns instead of creating child table. Be aware though that most of SQL databases have constraints on the number of columns in the table.


Caveats - maps
If you have a collection where each document looks differently from one another then there is no way to convert the whole collection into SQL tables with a finite number of columns. One would have to use wide table database like HBase that deals very well with sparse data in the columnar data store. Another caveat may occur when sometimes developers use maps for serialization purposes for example: Map<String, Struct<...>> to store data via DAO (Data Access Layer) where the string is key instead of fixed name for example: instead of using ContracId as the name of the field, developers may use actual contract id key's value: '00024535-8603-4dc7-9dc6-e7a55b3ce7d3'. That approach also makes such conversion difficult as there is no algorithm to discover whether the name of the field is a proper name in the schema-less collection or just a key value. This sort of problems falls into data modeling domain. Mongo allows to do that but as I said before, it makes MongoDB schema difficult to analyze. One remedy could be to provide a list of maps used to the analyzing algorithm. Another is to check the format of the field name: if the format matches GUID/UUID format then we can assume the field is a key and we have to move it to a child table instead of extending the current table.


0 New
Reply