Big Data Meets Microsoft Azure !
For Big Data & Cloud Community members this post on “Big Data, Meet Azure” is all about doing big on public cloud Azure. And sure, we no need definition for Big Data and Cloud Computing, but in a line; I would like to called both as Super Nova for the data ecosystem.
Let’s start with a question, what is the initial name for Microsoft Azure ?
- Windows Azure
- Blue Cloud
- Red Dog
Let’s dive in to have the big data analytics pipe line architecture for the big data on Azure. Below is the steps in the pipe line,
Now let’s see Azure Architecture to have Big Data ROI, by leveraging Big Data Analytics pipeline.
Now let’s dive into each component in the Big Data Analytics Pipe line,
Sources:
The source could be from the real time or batch data, and in another word, it could be bounded or un bounded datasets. This can be leverage the Push/Pull mechanism to ingest the data in to acquisition layers.
Stream Ingest:
The stream ingest can be archived by the following Big Data Azure services. And this is the transient storage, where the storage would be temporary.
Event Hub – The Event Hub helps to collect and manage millions of events per second from connected IoT devices and applications. Azure Event Hubs is a highly scalable data streaming platform and event ingestion service, capable of receiving and processing millions of events per second. Event Hubs can process and store events, data, or telemetry produced by distributed software and devices. Data sent to an event hub can be transformed and stored using any real-time analytics provider or batching/storage adapters. With the ability to provide publish-subscribe capabilities with low latency and at massive scale, Event Hubs serves as the “on ramp” for Big Data. For more details…
Azure IoT Hub- Azure IoT Hub is a fully managed service that enables reliable and secure bidirectional communications between millions of IoT devices and a solution back end. Provides multiple device-to-cloud and cloud-to-device communication options and these options include one-way messaging, file transfer, and request-reply methods. It also, provides built-in declarative message routing to other Azure services. For more details …
HDInsight Kafka – Apache Kafka is an open-source distributed streaming platform that can be used to build real-time streaming data pipelines and applications. Kafka also provides message broker functionality similar to a message queue, where you can publish and subscribe to named data streams. Kafka on HDInsight provides you with a managed, highly scalable, and highly available service in the Microsoft Azure cloud. For more details…
Batch Ingest:
In the Batch Ingest stack, we’ll see What are the Azure services used to achieve the batch ingest operations,
Import/Export – Azure Import/Export service helps to securely transfer large amounts of data to Azure Blob storage and Azure Files by shipping disk drives to an Azure data center. This service can also be used to transfer data from Azure storage to hard disk drives and ship to our on-premise sites. Data from a single internal SATA disk drive can be imported either to Azure Blob storage or Azure Files. For more details…
Data Factory – Azure Data Factory is a cloud data integration service, to compose data storage, movement, and processing services into automated data pipelines. It helps to manage big data analytics pipelines, as well as move and transform data for analysis. The Azure Data Factory can be leveraged by PowerShell, .NET, Python, REST, Portal, RM Template. For more details …
Persistent Storage:
To have persistence storage we have multiple options, but we’ll look in to the big data context only, and below are the services which will help to achieve persistent storage.
Blob Storage – Azure Blob storage is a service for storing large amounts of unstructured object data, such as text or binary data, that can be accessed from anywhere in the world via HTTP or HTTPS. You can use Blob storage to expose data publicly to the world, or to store application data privately. And the Blob Storage can be consumed with multiple options like PowerShell, CLI, .NET, Java, Python, Ruby, Node.js, Storage Explorer. For more details…
Azure Data Lake Store – The Data Lake Store Azure service is used to create a hyper-scale, Hadoop-compatible repository for analytics on data of any size, type, and ingestion speed. Azure Data Lake Store can be accessed from Hadoop (available with HDInsight cluster) using the WebHDFS-compatible REST APIs. It is specifically designed to enable analytics on the stored data and is tuned for performance for data analytics scenarios. Out of the box, it includes all the enterprise-grade capabilities—security, manageability, scalability, reliability, and availability—essential for real-world enterprise use cases. The Azure Data Lake store is an Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem. Our existing HDInsight applications or services that use the WebHDFS API can easily integrate with Data Lake Store. Data Lake Store also exposes a WebHDFS-compatible REST interface for applications. Data stored in Data Lake Store can be easily analyzed using Hadoop analytic frameworks such as MapReduce or Hive. Microsoft Azure HDInsight clusters can be provisioned and configured to directly access data stored in Data Lake Store. For More details…
Stream Processing:
Once we have data pipe line set for ingest like stream and batch, then transient and persistent storage, then we have to setup the stream processing engine with the below Azure services. And it’s not mandating to use all the services listed, it could be contextual and use case depends.
Azure Stream Analytics – Azure Stream Analytics is a managed event-processing engine set up real-time analytic computations on streaming data. The data can come from devices, sensors, web sites, social media feeds, applications, infrastructure systems, and more. And the Stream Analytics help us to examine high volumes of data streaming from devices or processes, extract information from that data stream, identify patterns, trends, and relationships. For more details…
HDInsight Spark / Storm – Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. Spark cluster on HDInsight is compatible with Azure Storage (WASB) as well as Azure Data Lake Store. Hence, our existing data stored in Azure can easily be processed via a Spark cluster. Apache Storm is a distributed, fault-tolerant, open-source computation system. You can use Storm to process streams of data in real time with Hadoop. Storm solutions can also provide guaranteed processing of data, with the ability to replay data that was not successfully processed the first time. For more details…
Batch Processing:
In this step of pipeline, we’ll see about the services like Azure Batch, Azure HDInsight Spark / Hive, Azure SQL Data warehouse, Azure Data Lake Analytics.
Azure Batch – Azure Batch is a platform service for running large-scale parallel and high-performance computing (HPC) applications efficiently in the cloud. Azure Batch schedules compute-intensive work to run on a managed collection of virtual machines, and can automatically scale compute resources to meet the needs of our jobs. With Azure Batch, we can easily define Azure compute resources to execute our applications in parallel, and at scale. And here we no need to manually create, configure, and manage an HPC cluster, individual virtual machines, virtual networks, or a complex job and task scheduling infrastructure. Azure Batch automates or simplifies these tasks. For more details…
Azure HDInsight Spark / Hive – Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. Spark cluster on HDInsight is compatible with Azure Storage (WASB) as well as Azure Data Lake Store. Hence, our existing data stored in Azure can easily be processed via a Spark cluster. When we create a Spark cluster on HDInsight, we create Azure compute resources with Spark installed and configured. It only takes about 10 minutes to create a Spark cluster in HDInsight. The data to be processed is stored in Azure Storage or Azure Data Lake Store. For more details..
Azure SQL Data warehouse – SQL Data Warehouse is a cloud-based Enterprise Data Warehouse (EDW) that leverages Massively Parallel Processing (MPP) to quickly run complex queries across petabytes of data. Use SQL Data Warehouse as a key component of a big data solution. Import big data into SQL Data Warehouse with simple PolyBase T-SQL queries, and then we can use the power of MPP to run high-performance analytics. As we integrate and analyze, the data warehouse will become the single version of truth of our business can count on for insights and intelligence. For more details…
Azure Data Lake Analytics – Azure Data Lake Analytics is an on-demand analytics job service to simplify big data analytics. We can focus on writing, running, and managing jobs rather than on operating distributed infrastructure. Instead of deploying, configuring, and tuning hardware, we write queries to transform our data and extract valuable insights. The analytics service can handle jobs of any scale instantly by setting the dial for how much power we need. We only pay for our job when it is running, making it cost-effective. The analytics service supports Azure Active Directory letting us to manage access and roles, integrated with our on-premises identity system. It also includes U-SQL, a language that unifies the benefits of SQL with the expressive power of user code. U-SQL’s scalable distributed runtime enables us to efficiently analyze data in the store and across SQL Servers in Azure, Azure SQL Database, and Azure SQL Data Warehouse. For more details…
Insights & Intelligence:
Any big data analytics pipeline should bring the focus to deliver insights and intelligences. Hence this can achieved and leveraged by the Azure services like Cognitive Services, and Azure Machine Learnings. Azure has separate blade for AI and Cognitive Services. In AI, we have Machine Learning Services, Azure Bot Services, and Azure Batch AI. And in the Cognitive category, we have Vision with Computer Vision, Computer Moderator, Face API; Speech with Bing Speech Service, Translator Speech; Language with Big Spell Check, Language Understanding (LUIS), Text Analytics, Translator Text; Knowledge with Recommendations; Search with Bing News, Video Search, Image Search. But in this discuss we’ll be seeing only the Machine Leaning Services, and Azure Bot Services.
Machine Learning Services – Azure Machine Learning is an integrated, end-to-end data science and advanced analytics solution. It enables data scientists to prepare data, develop experiments, and deploy models at cloud scale. And the Azure Machine Learning has components like Azure Machine Learning Workbench, Azure Machine Learning Experimentation Service, Azure Machine Learning Model Management Service, Microsoft Machine Learning Libraries for Apache Spark (MMLSpark Library), and Visual Studio Code Tools for AI. Hence with that our journey to the big data analytics applications and services building helps us significantly accelerate our data science project development and deployment. For more details…
Azure Bot Service – Bot Service provides what we need to build, connect, test, deploy, monitor, and manage bots. Bot Service provides the core components for creating bots, including the Bot Builder SDK for developing bots and the Bot Framework for connecting bots to channels. Bot Service provides an integrated environment purpose-built for bot development. We can write a bot, connect, test, deploy, and manage it from our web browser with no separate editor or source control required. For simple bots, we may not need to write code at all. It is powered by the Bot Framework and it provides with two hosting plans like; With the App Service plan, a bot is a standard Azure web app we can set to allocate a predefined capacity with predictable costs and scaling or with a Consumption plan, a bot is a serverless bot that runs on Azure Functions and uses the pay-per-run Azure Functions pricing. Hence the Bot Service accelerates bot development with Five bot templates we can choose from when we create a bot. For more details …
Serving Storage – Once we’ve done with batch and streaming processing, then we must serve the clients by leveraging Serving Storage at hyper low latency. Below are the Azure services will help us to achieve this, Azure CosmosDB, Azure SQL DB, Azure Redis Cache, Azure HDinsight HBase, Azure Search, Azure SQL Data warehouse, and Azure Analysis Services.
Azure Cosmos DB – Azure Cosmos DB is Microsoft’s globally distributed, multi-model database. With the click of a button, Azure Cosmos DB enables us to elastically and independently scale throughput and storage across any number of Azure’s geographic regions. It offers throughput, latency, availability, and consistency guarantees with comprehensive service level agreements (SLAs), something no other database service can offer. And we can try Azure Cosmos DB for free without an Azure subscription, free of charge and commitments. For more details…
Azure Redis Cache – Azure Redis Cache is based on the popular open source Redis cache. It gives us access to a secure, dedicated Redis cache, managed by Microsoft and accessible from any application within Azure. For more details ..
Azure Search – Azure Search is a search-as-a-service cloud solution that gives developers APIs and tools for adding a rich search experience over our content in web, mobile, and enterprise applications. Functionality is exposed through a simple REST API or .NET SDK that masks the inherent complexity of information retrieval. In addition to APIs, the Azure portal provides administration and content management support, with tools for prototyping and querying our indexes. Because the service runs in the cloud, infrastructure and availability are managed by Microsoft. For more details…
Analysis Services – Azure Analysis Services provides enterprise-grade data modeling in the cloud. It is a fully managed platform as a service (PaaS), integrated with Azure data platform services.
With Analysis Services, we can mashup and combine data from multiple sources, define metrics, and secure our data in a single, trusted semantic data model. The data model provides an easier and faster way for our users to browse massive amounts of data with client applications like Power BI, Excel, Reporting Services, third-party, and custom apps. For more details…
Power BI Embedded – This service from the Azure helps us to serve the clients. Power BI Embedded is intended to simplify how ISVs and developers use Power BI capabilities. Power BI Embedded simplifies Power BI capabilities by helping us quickly add stunning visuals, reports, and dashboards into our apps, and similar to apps built on Microsoft Azure use services like Machine Learning and IoT. By enabling easy-to-navigate data exploration in their apps, ISVs allow their customers to make quick, informed decisions in context. For more details..
Security and Governance – This section on Security and Governance is horizontal to all the services. We can achieve by HDInsight Metadata Stores, Azure Data Catalog, and Azure Active Directory.
Azure Data Catalog – Azure Data Catalog is a fully managed cloud service whose users can discover the data sources they need and understand the data sources they find. At the same time, Data Catalog helps organizations get more value from their existing investments. With Data Catalog, any user (analyst, data scientist, or developer) can discover, understand, and consume data sources. Data Catalog includes a crowdsourcing model of metadata and annotations. It is a single, central place for all of an organization’s users to contribute their knowledge and build a community and culture of data. For more details..
Azure Active Directory – Azure Active Directory (Azure AD) is Microsoft’s multi-tenant, cloud based directory and identity management service. Azure AD combines core directory services, advanced identity governance, and application access management. Azure AD also offers a rich, standards-based platform that enables developers to deliver access control to their applications, based on centralized policy and rules. Azure AD also includes a full suite of identity management capabilities including multi-factor authentication, device registration, self-service password management, self-service group management, privileged account management, role based access control, application usage monitoring, rich auditing and security monitoring and alerting. These capabilities can help secure cloud based applications, streamline IT processes, cut costs and help ensure that corporate compliance goals are met. For more details…
With that, let’s conclude, there is no one fit solution for Big Data Azure pipeline, but the above are view which is inspired from Azure Documentation, Posts, Azure Analytics Book By Zoiner. And the Azure ecosystem are constantly, and continuously evolving and new services & capabilities seems to be release regularly; so we will take decisions based on Azure documentation in the use case and client context.
Let us have coffee@dataottam.com.
Please subscribe dataottam blog to keep yourself up-to-the-minute on ABC of Data (Artificial Intelligence, Big Data, Cloud Computing, and Cognitive).
Happy Azure, Happy Holidays !