Tuesday, September 13, 2022

Architecture of Big Data

 

Big data architecture




 A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional database systems.

 

 


 

 

Big data solutions typically involve one or more of the following types of workload:

·         Batch processing of big data sources at rest.

·         Real-time processing of big data in motion.

·         Interactive exploration of big data.

·         Predictive analytics and machine learning.

Most big data architectures include some or all of the following components:

·         Data sources: All big data solutions start with one or more data sources. Examples include:

o    Application data stores, such as relational databases.

o    Static files produced by applications, such as web server log files.

o    Real-time data sources, such as IoT devices.

·         Data storage: Data for batch processing operations is typically stored in a distributed file store that can hold high volumes of large files in various formats. This kind of store is often called a data lake. Options for implementing this storage include Azure Data Lake Store or blob containers in Azure Storage.

 

·         Batch processing: Because the data sets are so large, often a big data solution must process data files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually, these jobs involve reading source files, processing them, and writing the output to new files. Options include running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.

 

·         Real-time message ingestion: If the solution includes real-time sources, the architecture must include a way to capture and store real-time messages for stream processing. This might be a simple data store, where incoming messages are dropped into a folder for processing. However, many solutions need a message ingestion store to act as a buffer for messages, and to support scale-out processing, reliable delivery, and other message queuing semantics. Options include Azure Event Hubs, Azure IoT Hubs, and Kafka.

 

 

·         Stream processing: After capturing real-time messages, the solution must process them by filtering, aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an output sink. Azure Stream Analytics provides a managed stream processing service based on perpetually running SQL queries that operate on unbounded streams. You can also use open source Apache streaming technologies like Storm and Spark Streaming in an HDInsight cluster.

 

·         Analytical data store: Many big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools. The analytical data store used to serve these queries can be a Kimball-style relational data warehouse, as seen in most traditional business intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency NoSQL technology such as HBase, or an interactive Hive database that provides a metadata abstraction over data files in the distributed data store. Azure Synapse Analytics provides a managed service for large-scale, cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which can also be used to serve data for analysis.

 

 

·         Analysis and reporting: The goal of most big data solutions is to provide insights into the data through analysis and reporting. To empower users to analyze the data, the architecture may include a data modeling layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. It might also support self-service BI, using the modeling and visualization technologies in Microsoft Power BI or Microsoft Excel. Analysis and reporting can also take the form of interactive data exploration by data scientists or data analysts. For these scenarios, many Azure services support analytical notebooks, such as Jupyter, enabling these users to leverage their existing skills with Python or R. For large-scale data exploration, you can use Microsoft R Server, either standalone or with Spark.

 

·         Orchestration: Most big data solutions consist of repeated data processing operations, encapsulated in workflows, that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results straight to a report or dashboard. To automate these workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.

Azure includes many services that can be used in a big data architecture. They fall roughly into two categories:

·         Managed services, including Azure Data Lake Store, Azure Data Lake Analytics, Azure Synapse Analytics, Azure Stream Analytics, Azure Event Hub, Azure IoT Hub, and Azure Data Factory.

 

·         Open-source technologies based on the Apache Hadoop platform, including HDFS, HBase, Hive, Pig, Spark, Storm, Oozie, Sqoop, and Kafka. These technologies are available on Azure in the Azure HDInsight service.

These options are not mutually exclusive, and many solutions combine open-source technologies with Azure services

Today, Big Data is meeting D2D Communication. Put,

·         Data-to-Decisions

·         Data-to-Discovery

·         Data-to-Dollars 

 

When to use this architecture

 

Consider this architecture style when you need to:

·         Store and process data in volumes too large for a traditional database.

·         Transform unstructured data for analysis and reporting.

·         Capture, process, and analyze unbounded streams of data in real time, or with low latency.

·         Use Azure Machine Learning or Microsoft Cognitive Services.

 

Benefits

·         Technology choices. You can mix and match Azure managed services and Apache technologies in HDInsight clusters, to capitalize on existing skills or technology investments.

·         Performance through parallelism. Big data solutions take advantage of parallelism, enabling high-performance solutions that scale to large volumes of data.

·         Elastic scale. All of the components in the big data architecture support scale-out provisioning, so that you can adjust your solution to small or large workloads, and pay only for the resources that you use.

·         Interoperability with existing solutions. The components of the big data architecture are also used for IoT processing and enterprise BI solutions, enabling you to create an integrated solution across data workloads.

·          

Challenges

·         Complexity. Big data solutions can be extremely complex, with numerous components to handle data ingestion from multiple data sources. It can be challenging to build, test, and troubleshoot big data processes. Moreover, there may be a large number of configuration settings across multiple systems that must be used in order to optimize performance.

 

·         Skillset. Many big data technologies are highly specialized, and use frameworks and languages that are not typical of more general application architectures. On the other hand, big data technologies are evolving new APIs that build on more established languages. For example, the U-SQL language in Azure Data Lake Analytics is based on a combination of Transact-SQL and C#. Similarly, SQL-based APIs are available for Hive, HBase, and Spark.

 

 

·         Technology maturity. Many of the technologies used in big data are evolving. While core Hadoop technologies such as Hive and Pig have stabilized, emerging technologies such as Spark introduce extensive changes and enhancements with each new release. Managed services such as Azure Data Lake Analytics and Azure Data Factory are relatively young, compared with other Azure services, and will likely evolve over time.

 

·         Security. Big data solutions usually rely on storing all static data in a centralized data lake. Securing access to this data can be challenging, especially when the data must be ingested and consumed by multiple applications and platforms.

 

Best practices

·         Leverage parallelism. Most big data processing technologies distribute the workload across multiple processing units. This requires that static data files are created and stored in a splittable format. Distributed file systems such as HDFS can optimize read and write performance, and the actual processing is performed by multiple cluster nodes in parallel, which reduces overall job times.

 

·         Partition data. Batch processing usually happens on a recurring schedule — for example, weekly or monthly. Partition data files, and data structures such as tables, based on temporal periods that match the processing schedule. That simplifies data ingestion and job scheduling, and makes it easier to troubleshoot failures. Also, partitioning tables that are used in Hive, U-SQL, or SQL queries can significantly improve query performance.

 

·          

·         Apply schema-on-read semantics. Using a data lake lets you to combine storage for files in multiple formats, whether structured, semi-structured, or unstructured. Use schema-on-read semantics, which project a schema onto the data when the data is processing, not when the data is stored. This builds flexibility into the solution, and prevents bottlenecks during data ingestion caused by data validation and type checking.

·         Process data in-place. Traditional BI solutions often use an extract, transform, and load (ETL) process to move data into a data warehouse. With larger volumes data, and a greater variety of formats, big data solutions generally use variations of ETL, such as transform, extract, and load (TEL). With this approach, the data is processed within the distributed data store, transforming it to the required structure, before moving the transformed data into an analytical data store.

 

·         Balance utilization and time costs. For batch processing jobs, it's important to consider two factors: The per-unit cost of the compute nodes, and the per-minute cost of using those nodes to complete the job. For example, a batch job may take eight hours with four cluster nodes. However, it might turn out that the job uses all four nodes only during the first two hours, and after that, only two nodes are required. In that case, running the entire job on two nodes would increase the total job time, but would not double it, so the total cost would be less. In some business scenarios, a longer processing time may be preferable to the higher cost of using underutilized cluster resources.

 

·         Separate cluster resources. When deploying HDInsight clusters, you will normally achieve better performance by provisioning separate cluster resources for each type of workload. For example, although Spark clusters include Hive, if you need to perform extensive processing with both Hive and Spark, you should consider deploying separate dedicated Spark and Hadoop clusters. Similarly, if you are using HBase and Storm for low latency stream processing and Hive for batch processing, consider separate clusters for Storm, HBase, and Hadoop.

 

·         Orchestrate data ingestion. In some cases, existing business applications may write data files for batch processing directly into Azure storage blob containers, where they can be consumed by HDInsight or Azure Data Lake Analytics. However, you will often need to orchestrate the ingestion of data from on-premises or external data sources into the data lake. Use an orchestration workflow or pipeline, such as those supported by Azure Data Factory or Oozie, to achieve this in a predictable and centrally manageable fashion.

 

·         Scrub sensitive data early. The data ingestion workflow should scrub sensitive data early in the process, to avoid storing it in the data lake.

IoT architecture

Internet of Things (IoT) is a specialized subset of big data solutions. The following diagram shows a possible logical architecture for IoT. The diagram emphasizes the event-streaming components of the architecture.




The cloud gateway ingests device events at the cloud boundary, using a reliable, low latency messaging system.

Devices might send events directly to the cloud gateway, or through a field gateway. A field gateway is a specialized device or software, usually collocated with the devices, that receives events and forwards them to the cloud gateway. The field gateway might also preprocess the raw device events, performing functions such as filtering, aggregation, or protocol transformation.

After ingestion, events go through one or more stream processors that can route the data (for example, to storage) or perform analytics and other processing.

The following are some common types of processing. (This list is certainly not exhaustive.)

·         Writing event data to cold storage, for archiving or batch analytics.

·         Hot path analytics, analyzing the event stream in (near) real time, to detect anomalies, recognize patterns over rolling time windows, or trigger alerts when a specific condition occurs in the stream.

·         Handling special types of non-telemetry messages from devices, such as notifications and alarms.

·         Machine learning.

The boxes that are shaded gray show components of an IoT system that are not directly related to event streaming, but are included here for completeness.

·         The device registry is a database of the provisioned devices, including the device IDs and usually device metadata, such as location.

·         The provisioning API is a common external interface for provisioning and registering new devices.

·         Some IoT solutions allow command and control messages to be sent to devices.

What are the Big Data Architecture Layers?

 

 

 

In this architecture there are around 6 layers, which ensure a secure flow of data.

·         Ingestion Layer

·         Collector Layer

·         Consumption Layer

·         Assimilation Layer

·         Processing Layer 

·         Storage Layer 

·         Query Layer 

·         Visualization Layer

Ingestion Layer

This layer is the first step for the data coming from variable sources to start its journey. This means the data here is prioritized and categorized, making data flow smoothly in further layers in this process flow.

Collector Layer

In this Layer, more focus is on the transportation of data from the ingestion layer to the rest of the data pipeline. It is the Layer where components are broken so that analytic capabilities may begin.

Consumption Layer

It is the first step toward the journey of the data collected from various sources.  Data consumption is the method of privatising and dividing the data to enable it to flow effectively and easily from the layers of the Data consumption process flow.

Assimilation Layer

Data is transported from the consumption layer to other data pipelines for analysis during this layer. The disintegration of data components occurs during this layer.

 

Processing Layer

In this primary layer, the focus is to specialize in the pipeline processing system. We can say that the information we have collected in the previous layer is processed in this layer. Here we do some magic with the data to route them to a different destination and classify the data flow, and it’s the first point where the analytic may occur. 

 

Storage Layer

Storage becomes a challenge when the size of the data you are dealing with becomes large. Several possible solutions, like Data Ingestion Patterns can rescue from such problems. Finding a storage solution is very much important when the size of your data becomes large. This layer focuses on “where to store such large data efficiently.”

Query Layer

This is the layer where active analytic processing takes place. Here, the primary focus is to collect the data value to make it more helpful for the next layer.

Visualization Layer

The visualization, or presentation tier, probably the most prestigious tier, where the data pipeline users may feel the VALUE of DATA. We need something that will grab people’s attention, pull them into, make your findings well-understood.

 

 

What are the Ingestion tools?

Flume 

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log workloads. It has a straightforward and flexible architecture based on streaming data flows. Apache Flume is robust and faults tolerant with tunable reliability mechanisms and many failovers and recovery mechanisms. It uses a simple, extensible Big Data Security model that allows for an online analytic application and ingestion process flow. Functions of Apache Flume are:

·         Stream Data: Ingest streaming information from multiple sources into Hadoop for storage and analysis.

·         Insulate System: Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at which data can be written to the destination

·         Scale Horizontally: For new Ingestion streams and additional volume as needed.

Apache Nifi

·         It is another of the best Ingestion tools that provide an easy-to-use, powerful, and reliable system to process and distribute information. Apache NiFi supports robust and scalable directed graphs of routing, transformation, and system mediation logic. Functions of Apache Nifi are: Track information flow from beginning to end.
The seamless experience between design, control, feedback, and monitoring
Secure because of SSL, SSH, HTTPS, encrypted content.

Elastic Logstash

·         Elastic Logstash is an open-source ingestion tool, server-side processing pipeline that ingests information from many sources, simultaneously transforms it, and then sends it to your “stash, ” i.e., Elasticsearch. Functions of Elastic Logstash:Easily ingests from your logs, metrics, web applications, stores.
Multiple AWS services and done in a continuous, streaming fashion
Ingest Data of all Shapes, Sizes, and Sources

 

 

What is Data Ingestion Framework?

 

Apache Gobblin

·         It is a unified framework for extracting, transforming, and loading a large volume of data from various sources. It can ingest data from different sources in the same execution framework and manages metadata of different sources in one place. Gobblin combined with other features such as auto scalability, fault tolerance, quality assurance, extensibility, and the ability to handle model evolution. It an easy-to-use, self-serving, and efficient ingestion framework. Explore Apache Gobblin. 

Big data technology stack layers

 Data layer

·         The technology at the ground level works to store the excess of new data coming from old sources like OLTP and the new and slightly structured places like log files, different sensors, website analytics, various documents, and media stores. The storage in this layer takes place inside the cloud or on the local and virtual resources. Data Storage Systems like Amazon S3, Hadoop HDFS, and MongoDB are some of the systems available under this layer. The organisation uses them on a large scale to store the data efficiently and in an organised manner.

Data ingestion layer 

·         If the aim is to make a large data store, importing the data from their source of origin into the data layer is essential. These specialised tools include data warehouses. We need the data pipeline for the ingestion of the data. One can take the support of a superior natural system of big data integration tools, which includes powerful tools that extract the data from various original locations, modify it, and fill it into a destined system of one's choice. Examples of tools used to ingest big data are Stitch, Blendo, Apache Kafka, etc.

Data warehousing layer

·         The data warehousing layer optimises the data to enable easy and smooth analysis. Also, an engine acts as a competitor and helps run the queries. This layer breaks down the numbers to allow us the process of analysis. Analysts and data scientists use it to run SQL enquiries on massive volumes of data. Some big data also needs excess computing ability to run SQL queries. If you want to process data at optimum scale, then data warehouse tools are the best options, while if you want to store a large amount of raw data to accommodate several use cases, then go for the data lake. If you want to process and analyse the data stored in the data lake, you will require other technologies to assist the data lake.

·         Some of the most leveraged robust cloud data warehouses are Apache Spark, Amazon Redshift, SQream, PostgreSQL, etc.

Data analytics layer

·         After the data is collected for analysis by the data layer, it is mixed together by the integration layer. At the same time, it is optimized and organised, and the queries are run against them by the data processing layer. Finally, the data analytics and BI layer are needed to make appropriate decisions based on the data. By working with this layer, you can run the inquiries to provide the solutions for the questions the business asks, examine the data from different viewpoints, develop dashboards and make advanced and beautiful visualisations.

·         The users can use BI tools such as Looker, Tableau, Power BI, etc., to further increase their business components of the data analytics to the next level.

YouTube link : Hadoop series 1 - YouTube

 Follow 👉 syed ashraf quadri👈 for awesome stuff

 

No comments:

Post a Comment