Recently, we had to develop a software ecosystem which would acquire data (structured and unstructured) from many different sources and propagate them to a consolidated data platform – in real time. The system had to achieve the following objective, using open source technologies:
• Access large number of different categories of unstructured data.
• Achieve fault tolerance
• Ensure 0% data loss
• Provide fast and parallel data processing
• Treat each data differently based on its type.
• Facilitate logging and admin notifications
We can use Apache Hadoop for this kind of big data processing but it does not have a fault tolerant queue system and differential handling of each data source will also be a problem. We considered point to point messaging systems available in languages like Java/Scala but they are not a good fit with respect to fault tolerance, no data loss and separate processing of data.
We had a long discussion hours with our team of experts and went on a hunt for new technologies in web. Finally after long deliberations, we decided on using the combination of three open source software – Apache Zookeeper, Apache Kafka and Apache Storm. A collaborative solution using these three, together with Scala was a perfect fit for our needs.
Apache Kafka is a producer subscriber messaging system which will maintain the data queue and the data can be processed by accessing from the queue. Kafka is a producer consumer messaging system, using which the data will be produced and queued on a topic and this in turn, can be accessed from the consumers by using the topic name on which its queued. Kafka will use Zookeeper to maintain the data in a data directory which we will configure. Zookeeper will also maintain the configurations of all these Kafka and Storm. Apache Storm is a real time computing system and Storm will access data from Kafka consumers by using the topic names to a storm component called ‘Spouts’. The spouts will send the data to another storm component called ‘bolts’ and we can have multiple bolts, each bolt will process the data separately at the same time. We can save different types of data to different database and perform similar different operations. Bolts are the workers of the Storm – they will perform real time parallel computing. All we have to do is to configure our Storm and deploy a ‘Topology’ for storm Spouts and bolts with our business logics to perform. All these combinations can be setup in a clustered environment and the cluster configurations will be setup in zookeeper and storm. Kafka will elect the leader from multiple servers for each request by using zookeeper. Proper logging and notification can be achieved by setting up certain configurations to the components. There will not be any data loss because Kafka will have all the data in its queue even if the server has been shut down. The processing will continue once the server recovers from the failure. The overall representation of the queue system is shown below.
In our application, we used a Scala REST API framework named ‘Play’ and integrated all the supported software components of Zookeeper, Kafka and Storm from Scala. All these components can be installed and configured easily – it’s just a matter of configuring all our needs in the relevant configuration files. We had to write code for, Producer consumer business logic in Kafka and data processing business logic in Storm. The server maintenance is easy because we have written some auto running scripts to start, stop and restart the servers and the same for deployment process too. And configuration of zookeeper and Kafka were fine tuned to clean the unused data so as to ensure optimal utilization of the servers and to avoid memory wastage.
Though the system is complex, it was relatively easy to develop and the use of open source tools required only minimal effort. The crux is to choose the ideal technologies which can deliver our needs. Some of the key parameters to keep in mind are the functional requirements, efficiency, best tested results of the technology, web support and flexibility in usage. Zookeeper, Kafka and storm combination is an ideal one which we realized and learned from our own projects.