Quantcast
Channel: Kumar Chinnakali – dataottam
Viewing all articles
Browse latest Browse all 65

Apache Spark is Superstar; but it’s Supernova on Azure for Big Data Analytics Initiatives.

$
0
0

Apache Spark is Superstar; but it’s Supernova on Azure for Big Data Analytics Initiatives

Dear Cloud & Data Community, Happy Christmas!

In this post am happy to share with you all on the facts about Apache Spark, especially how it’s so special and super nova, when it’s spin on as the Azure HDInsight. Big Data is good, but it’s is very hard from the 1st step of Buying Servers to Scale up or down issues. Some of the issues in the steps are OSS installation, security handling, configuration optimization, debugging, and bring the success to the big data analytics initiatives. But HDInsight made it easy.

Azure HDInsight is fully managed and open source big data analytics service for enterprises. HDInsight is a Azure cloud service that makes it easy and fast to process the very massive amounts of data, and not only that HDInsight supports a broad range of scenarios like extract, load, transform(ELT), extract, transform, and load (ETL), data warehousing, IoT and machine learning. It can deployed on Windows or Linux machine, which is great and unique capability. Actually as of now the Apache Spark HDInsight is not available on Windows, only Hadoop, HBase, and Storm are available on both operating systems, which will change in the future. And because it’s managed IaaS service, it can be up and running within a few hours in an enterprise level.

Apache Spark is one of the most hyped project in Hadoop ecosystem, and it’s making the real impact. Azure HDInsight supports Apache Spark, which is same fully managed service as HDInsight. It’s fully supported by Microsoft and Hortonworks with 99.9% SLA.  It’s optimized for data exploration, along with that it has ODBC connectors for PowerBI, Tableau, Qlik, SAP and Excel.

Below is the High lever HDInsight Architecture (Thanks to great Azure documentations).

2

Azure HDInsight Spark helps to build big data analytics solutions like batch processing, stream analytics, interactive processing, and it supports the storing and retrieving data in a robust manner.  HDInsight is a Hadoop distribution from Hortonworks and lot of misconception that it is port of core Hadoop by Microsoft, but it is actually not. It is regular, open source Hadoop—not a special Microsoft version of Hadoop.

The Apache Spark HDInsight clusters can be consumed from the various set of sources like SQL Databases, Azure Data Lake Store, Azure Blob Store, Azure Cosmos DB,, SQL Data Warehouse, Couchbase, Elasticsearch, Hive Tables, MongoDB, Neo4j, Avro Files, CSV Files, JSON Files, LZO Compressed Files, Parquet Files, Redis, Riak, and Zip Files.

Now let’s see the advantages of Apache Spark HDInsight on Azure,

  • Debug Apache Spark jobs running on Azure HDInsight is very easy with visualizations
  • Very quickly Hassle-free provisioning and we can use the right cluster size and hardware capacity
  • HDInsight provides seamless integration with other Hadoop ecosystems like Hive, Pig, HBase and Apache Spark, with no worries on the versions complexity.
  • Automate cluster tasks with easy and flexible PowerShell scripts or from an Azure command-line tool and we can scale up, down, in and out.
  • It can be used with the Azure virtual network to support isolation of cloud resources or hybrid scenarios where we link cloud resources with our local data center.
  • Apache Spark doesn’t provide out-of-the-box support for building C# applications. But a project called Mobius is an open source enables the implementation of the Spark driver program and data processing operations in the .NET framework in language like C# and which makes C# a first-class citizen in Spark app development.
  • With respect to storage – Storage optimized for analytics, native web-HDFS compatible storage, and no size or scale of limits.
  • And with regards to Security – Fully integrated with Azure Active Directory, Access Control Lists on files and folders
  • It gives in-memory processing with Spark for interactive business intelligence with scale in out, up and down when it’s on HDInsight Spark.
  • HDInsight can deployed globally within minutes, with multi-region availability.
  • Security and compliance to enable OSS for enterprises Apache Spark applications.
  • It had rich developer ecosystem with eclipse, IntelliJ IDEA, R Studio, Zepplin, Jupyter, Visual Studio (My Favorite )
  • Recognized by Top analysts like Forrester for the Ware for Big Data Analytics Hadoop/Spark Cloud.
  • Very much advantage of using HDInsight Application Platform, helps to discover and install apps from ecosystems with one click experience. Applications available for the productivity are like Dataiku DSS, WANdisco, H2O AI, Self-Service Data Preparation  Paxata,  StreamSets Data Collector, AtScale Itnelligence Platform, Cask CDAP, Datameer, Kyligence Analytics Platform, KNIME Spark Job Server.
  • As of now it has latest stable Spark 2.1.0 (HDI 3.6) version with option of enterprise security package enablement for the Apache Spark cluster.
  • With the domain-joined HDInsight Spark clusters, we can create an Spark cluster joined to an Active Directory domain, configure a list of employees from the enterprise who can authenticate through Azure Active Directory to log on to Spark cluster, which enables the Spark application security.
  • We can make customized Apache Spark HDInsight cluster with size, settings, and apps or we can leverage the prebuilt option to make it faster.
  • Overall Apache Spark HDInsight is a complete package for today’s real-world big data processing application.

To conclude by using HDInsight Apache Spark on Azure helps us to build clusters well, just rather than building cluster. HDInsight Apache Spark makes it easy  by having 100% open source, optimized, highly available, secure, scalable, dedicated, managed, certified ISVs and customizable. Hence the Apache Spark HDInsight makes the big data easy, by enabling the can-do analytics on any data, and any size; which is easier and more productive for all users, with enterprise-ready.

Run Apache Spark at enterprise grade and scale on HDInsight.

Please subscribe dataottam blog to keep yourself up-to-the-minute on ABC of Data (Artificial Intelligence, Big Data, Cloud Computing, and Cognitive).

Reach us via coffee@dataottam.com. Happy Azure and Happy Holidays!


Viewing all articles
Browse latest Browse all 65

Trending Articles