+1 (617) 314-9079

Blog

 

Big Data Landscape 2016- a General Sherman

Did you watch the jungle book movie??? I mean not the new fantasy adventure but the old funny “George of the jungle”. I just love Brendan Fraser in that movie, nobody could’ve done justice to that role except him but I too like the narrator who keeps on making the scenes so interesting and funny. Right?!

So I thought maybe I could narrate the series of Big Data landscape (please bear with me as this is my childhood fantasy) at least once. Hence, let me recite you a wonderful story . A lovely garden named, Big Data Analytics had the most wonderful trees in it. It had the biggest 6 trees which made the garden so lively. Would you like to know what these trees are? Then what you are waiting for, c’mon dig in.

• Infrastructure
• Analytics
• Applications
• Cross-Infrastructure/Analytics
• Open Source
• Data Sources/APIs

Now, let us explore the branches which make the infrastructure tree more spectacular and also we have mentioned the leaves (tools) that clench the top places in the branch…

Infrastructure- Family Tree:

bda1

Each leaf in these twigs, add a unique flavor and bring in a special offering. These leaves have some precious things in them. And I would love to mention those listicles to you.

Hadoop On-Premise:

In many organizations, Hadoop is the central engine of their analytics express. Deciding on whether this engine should be on premises and choosing a right provider for this, is as important as deciding which analytics are required. Given below are some organizations that can handle the set-up and maintenance of Hadoop platform on their premises. Each one also has a unique offering besides the standard Hadoop elements.

Tools:

  • Cloudera
  • Hortonworks
  • MAPR
  • Pivotal
  • IBM Infosphere
  • Bluedata
  • Jethro
  • Splice Machine

Hadoop in the Cloud:

Hadoop in the cloud is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.

Tools:

  • Amazon Web services
  • Microsoft Azure
  • Google Cloud Platform
  • IBM Infosphere
  • Treasure Data
  • Cazena
  • Altiscale
  • Qubole
  • Xplenty

Spark:

Apache Spark is an open-source engine developed specifically for handling large-scale data processing and analytics. Spark offers the ability to access data in a variety of sources, including Hadoop Distributed File System (HDFS), OpenStack Swift, Amazon S3 and Cassandra.

Tools:

  • Databricks
  • GridGain
  • Tachyon NEXUS

Cluster Services:

Microsoft Cluster Service (MSCS) is a service that provides high availability (HA) for applications such as databases, messaging and file and print services.

Tools:

  • Amazon Web services
  • HPCC Systems
  • Kubernetes
  • Docker
  • Mesosphere
  • CoreOS
  • Pepperdata
  • StackIQ

NoSQL Databases:

A NoSQL (originally referring to “non SQL” or “non-relational”) database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases.

Tools:

  • Amazon DynamoDB
  • Google Cloud Platform
  • Oracle
  • Microsoft Azure
  • MarkLogic
  • mongoDB
  • Datastax
  • Aerospike
  • Couchbase
  • SequoiaDB
  • Redislabs
  • Influxdata

NewSQL Databases:

NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system.

Tools:

  • SAP
  • Clustrix
  • Pivotal
  • Paradigm4
  • Memsql
  • MariaDB
  • VoltDB
  • Citusdata
  • Deepdb
  • Trafodion
  • Cockroach Labs
  • Nuodb

Graph Databases:

A graph database, also called a graph-oriented database, is a type of NoSQL database that uses graph theory to store, map and query relationships.

Tools:

  • Neo4j
  • Apache Giraph
  • OrientDB
  • InfiniteGraph

MPP Databases:

MPP (massively parallel processing) is the coordinated processing of a program by multiple processors working on different parts of the program. Each processor has its own operating system and memory. MPP speeds the performance of huge databases that deal with massive amounts of data.

Tools:

  • Teradata
  • Vertica
  • Netezza
  • Kognitio
  • Dremio

Cloud EDW:

Most on-premises data warehouse (DW) platforms are appliance-based, which makes them difficult to expand, and the resulting need to leave room for growth also makes them expensive to acquire. In the cloud though, economics are better, elasticity is realistic and logistics are streamlined. Combine that with the ability to handle “big data” volumes with the familiar SQL/relational model that Redshift uses and it’s hardly surprising that the service has been one of Amazon’s fastest growing since its launch.

Tools:

  • Amazon web services
  • Google Cloud platform
  • Microsoft Azure
  • Pivotal
  • Snowflake
  • Waterline Data
  • InfoWorks

Data Transformation:

Data transformation refers to the modification of every point in a data set by a mathematical function. When applying transformations, the measurement scale of the variable is modified. Data transformation is most often employed to change data to the appropriate form for a particular statistical test or method.

Tools:

  • Alteryx
  • Trifacta
  • Tamr
  • Paxata
  • StreamSets
  • Alation

Data Integration:

Data integration is the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information. A complete data integration solution delivers trusted data from a variety of sources.

Tools:

  • Informatica
  • Mulesoft
  • snapLogic
  • BedrockData

Management/Monitoring:

Big data management refers to the efficient handling, organization or use of large volumes of structured and unstructured data belonging to an organization.

Tools:

  • New Relic
  • AppDynamics
  • Amazon Web services
  • Actifio
  • Numerify
  • Splunk
  • DataDog
  • Rocana
  • Anodat

Security:

So what can be done to help bring the security of traditional database management to big data? Several organizations describe and define different security controls. The below list contains several tools that we would recommend to address the security challenges presented by big data.

Tools:

  • Tanium
  • Illumio
  • CODE42
  • DataGravity
  • CipherCloud
  • Vectra
  • Sqrrl
  • BlueTalon

Storage:

At root, the key requirements of big data storage are that it can handle very large amounts of data and keep scaling to keep up with growth and that it can provide the input/output operations per second (IOPS) necessary to deliver data to analytics tools.

Tools:

  • Amazon Web services
  • Google cloud Platform
  • Microsoft Azure
  • Panasas
  • Nimblestorage
  • Qumulo

Application Development:

Application development is the development of a software product in a planned and structured process.

Tools:

  • Apigee
  • CASK
  • Keen IO
  • TypeSafe
  • Concurrent

Crowd-Sourcing:

Crowdsourcing refers to a wide range of activities, providing different benefits for its organizers. Crowdsourcing in the form of idea competitions or innovation contests provides a way for organizations to learn beyond what their “base of minds” of employees provides (e.g., LEGO Ideas).

Tools:

  • Amazon Mechanical Turk
  • Crowdflower
  • Workfusion

I know the list is too long but also the things you can learn. Right?! This story about the infrastructure tree ends here. If you want to know about the rest of the trees stay tuned. We will be back with a bang…

Share with friends   

Written by

The author did not add any Information to his profile yet

Comments 0

Leave a Reply

*