– Sarath Chandran and Ashwin B
In one of our recent projects we were building an enterprise cloud based digital document management and collaboration solution for one of our clients. This application needed to have the ability to upload large documents with its associated thumbnails, and extracted OCR text for searching ability. The system would need to support document versioning which meant each version will have a separate thumbnail file and OCR file. This made it necessary to have a database which could store large number of documents of varying sizes and a scalable retrieval mechanism to fetch the documents.
The RDBMS option:
RDBMS was not really an option for this project for the following reasons.
- In RDBMS, you must define a schema before adding records to a database. The schema is the structure described in a formal language supported by the database and provides a blueprint for the tables in a database and the relationships between tables of data. With a schema based approach, the Meta data of documents being uploaded would be restricted by the datatype definitions in it; which would limit the range of documents one could upload. This limitation disappears when you use a NoSQL database.
- Further to the above point, within each table, you need to define constraints in terms of rows and named columns as well as the type of data that can be stored in each column. With NoSQL, based on the type of data, these constraint definitions are automated, thereby benefitting developer time and efficiency.
- Files can be stored only as BLOB data type. This data type can hold a maximum value of about 4294967295 Bytes, which is equivalent to 4.25GB approximately. Ultimately this restricts an application to store files which has less than or equal to 4.25GB.
- The system requirements mandated the need for an automatic switchover from master > slave for application availability and automatic promotion of Slave to Master when the Master goes down. In this option, this automatic switch over of slave to master is not available when master server is down.
Due to the above challenges we decided to go with NoSQL document oriented database options. After evaluating different options we ultimately went with MongoDB.
What is MongoDB?
MongoDB is a document oriented database which provides high performance, high availability and easy scalability. It is classified as a NoSQL database which breaks the traditional table based relational database structure and provides a JSON like document structure with dynamic schema called BSON format. It also supports sharding and Conversion / mapping of application objects to database objects are not needed. It uses internal memory for storing the working set, which enables faster access of data. It has deep query-ability, which means MongoDB supports dynamic queries on documents using a document based query language.
MongoDB over other Document store databases:
MongoDB obviously isn’t the only NoSQL database out there. Besides MongoDB, there are many document oriented open source databases like Elasticsearch, ArangoDB, Couchbase, SequoiaDB, JSON ODM etc. Among those Couchbase was one of our top options against MongoDB and this is how they compared:
From the above table, based on the ranking of MongoDB (Rank -1) over the Couchbase(Rank -2) with respect to the document storage and MongoDB’s role based access rights mechanism, we chose MongoDB for the Document storage application.
Size limitation in Mongo DB and the GridFS solution:
A catch with MongoDB by itself is that it has a size limit of 16MB for document store which obviously wouldn’t do. Though the maximum BSON document size limit of MongoDB is 16MB, it can support bigger files if we use it with GridFS API.
GridFS is a specification for storing and retrieving files that exceed the BSON document size limit of 16MB. Instead of storing a file in a single document, GridFS divides a file into parts or chunks and stores each of those chunks as a separate document. It uses two types of collections to store files. One collection will store the file chunks (the actual contents of the file) and other collection stores the file metadata. This makes an application to support unlimited file size.
Besides the size benefit, GridFS is also useful for storing any files you want access without having to load the entire file on to memory.
Having implemented the above with schema-less architecture for document storage, unlimited upload capability and an automated DB failover mechanism to boot, the project is up, pretty much firing on all cylinders and running 24 x7. In itself, goal of this project was a product upgrade from an older version and this architecture has yielded performance and document management capability improvements many fold.