What is Beyond Classic Hadoop? Is it Spark and Flink?
In this blog, we will explore the two new big data friends to Hadoop, and they are Apache Spark and Apache Flink.
And if we take the Hadoop improvements with the parallel processing MapReduce; speed is very first focus. However, MapReduce is designed and developed for big data batch processing. Traditionally the MapReduce is very I/O and CPU heavy, so it will be slower.
So Apache Tez introduce the optimization mechanism, where we don’t need to have as many Inputs and outputs as like MapReduce. Apache Tez introduces straight streaming which brought us significant improvement on the performance side. Apache Tez is slowly moved from batch to near real time. And also big data scientist brought great interface called Yet Another Resource Negotiator (YARN); where any big data apps are not necessarily to go through the MapReduce
Apache Tez is slowly moved from batch to near real time. And also big data scientist brought great interface called Yet Another Resource Negotiator (YARN); where any big data apps are not necessarily to go through the MapReduce framework only. MapReduce have not used many of the ideas behind Relational Databases, particularly the distributed RDBMS. Hence two of the big data processing framework is more focused on the above pitfalls. One of them is Apache Spark enable the processing into memory and the second one is Apache Flink. As of now both Apache Spark and Apache Flink have lot of common features. It takes the Hadoop to the NextGen processing system for real-time and streaming of big data.
Apache Spark:
Important features are In-memory data structures and the data structures use something called Resilient Distributed Dataset(RDDs). And why the Apache Spark’s RDDs are resilient because any data sets can rebuild themselves if any failure or data corruption occurs. So basically we have something like traceability in the entire data sets. And it’s coming with rich set of operators and many of them are inspired from the Scala collection library. Apache Spark is efficient system; which is 10x on Disk to 100x in Memory very faster than Hadoop MR. And also it has rich set of APIs in Scala, Java, and Python. So developer and architect can code 2 to 5 time less code.
Apache Spark Key Component:
As a foundation we have Spark Core Engine, with multiple components on top of it. So we have Spark SQL; which is bringing conventional relational data base SQL skills, Spark Streaming API is specially designed and implemented for data streaming and real time big data applications; but it can do both and real time. And we have MLLib for machine learning where it had varieties of library of algorithms are available. And we have GrapX for all kind of graph based data processing and Spark R to implement the R programming and make the R programming to do parallel running in the huge data sets.
Data sharing in Apache Spark is very interesting where it reduces the I/O and it brings more efficiency to the big data applications.
Apache Flink:
Apache Flink is a new system in the Apache Hadoop ecosystem and one of the interesting features of Apache Flink is the execution; where the program complied into an execution plan. So the plans are optimized and eventually executed. So since we are from relational data base world, we are might say the above said might not be new. Apache Flink is a result of European research project with couple of goals like very high Performance, and provides Hybrid architecture by unification of the both batch and streaming runtime. And the final with goals to provide simplicity frame work for the developers. Apache Spark and Apache Flink have common points like both are high usage of Scala, a functional programming.
Internal data structure of Apache Flink is called DataSet; which is similar to Spark RDDs. And the DataSet can be produced or recovered in transparent ways like a Scala or Java collections. And also like RDDs, but sometimes it is not materialized at all, and sometimes updated with an iteration.
So we have many optimizers in Apache Flink like Optimizer will select an execution plan; which similar to what we have in relational database. And the system is going to optimize by optimal plan which is depends on the size of the input files. And while running we can able to run Apache Flink as standalone or on the top of Hadoop cluster, which is on the top of HDFS distributed data store. Now a days many big data management tools are by default having integration with many Hadoop technologies, including Apache Spark and Apache Flink.
Advantages and Outlook:
So both have the many advantages like Iterative Algorithms, Caching, Less IO overhead than conventional Hadoop, Unifying batch and stream computing, Scala as a natural, expressive language for Big Data and other languages like SQL also in the row. And as we seeing may more tools around the big data systems, we have to beware of less mature components.
Reference – Architect and Build Big Data Apps, Ben Lorica
Interesting? Please subscribe to our blogs at www.dataottam.com to keep yourself trendy on Big Data, Analytics, and IoT.
Let us have coffee@dataottam.com !