Cloud Stable: Big Data Tools and Techniques

A big data tool can be classified into the four buckets listed below based on its practicability.

1.    Massively Parallel Processing (MPP)
2.    No-SQL Databases
3.    Distributed Storage and Processing Tools
4.    Cloud Computing Tools

Massively Parallel Processing (MPP)
A loosely coupled or shared nothing storage system is a massively parallel processing construct with the goal of dividing up a large number of computing machines into discrete pieces and proceeding in parallel. An MPP system is also referred to as a loosely coupled or shared nothing system. Processing is accomplished by breaking a large number of computer processors into separate bits and proceeding in parallel.

Each processor works on separate tasks, has a different operating system, and does not share memory. It is also possible for up to 200 or more processors to work on applications connected to this high-speed network. In each case, the processor handles a different set of instructions and has a different operating system, which is not shared. MPP may also send messages between processes via a messaging system that allows it to send commands to the processors.

MPP-based databases are IBM Netezza, Oracle Exadata, Teradata, SAP HANA, EMC Greenplum.

No-SQL Databases
Structures are employed to help associate data with a particular domain. Data cannot be stored in a structured database unless it is first converted to one. SQL (or NoSQL) is a non-structured language used to encapsulate unstructured data and create structures for heterogeneous data in the same domain. NoSQL databases offer a vast array of configuration scalability, as well as versatility, and scalability in handling large quantities of data. There is also distributed data storage, making data available locally or remotely.

NoSQL databases include the following categories:

1.    Key-value Pair Based
2.    Graphs based
3.    Column-oriented Graph
4.    Document-oriented

Key-value model: Dictionaries, collections, and associative arrays can often use hash tables to store data, but this database stores information in a unique key-value pair. The key is required to access data and the value is used to record information. It helps store data without a schema. The key is unique and used to retrieve and update data, while the value is a string, char, JSON, or BLOB. Redis, Dynamo, and Riak are the key-value store databases.

Graph-based model: Graph databases store both entities and relationships between them, and they are multi-relational. Nodes and links and entities are stored as elements on the graph, and relationships between these elements are represented by edges (or nodes). Graph databases are employed for mapping, transportation, social networks, and spatial data applications. They may also be used to discover patterns in semi-structured and unstructured data. The Neo4J, FlockDB, and OrientDB graph databases are available.

Column-based NoSQL database: Columnar databases work on columns. Compared to relational databases, they have a set of columns rather than tables. Each column is significant, and it is viewed independently. The values in the database are stored in a contiguous manner and may not have values. Because columns are easy to assess, columnar databases are efficient at performing summarisation jobs such as SUM, COUNT, AVG, MIN, and MAX.

Column families are also known as Wide columns or Columnar columns or Column stores. These are used for data structures, business intelligence, CRM, and catalogues of library cards.

The columnar databases Cassandra, HBase, and Hypertable use NoSQL databases that use columnar storage.

Document-Oriented NoSQL database: The document-oriented database stores documents in order to make them essentially document-oriented rather than data-oriented. JSON or XML are the formats used for data, and key-value pairs and the format of JSON or XML are used for data. E-commerce applications, blogging platforms, real-time analytics, Content Management systems (CMS), are among the applications that benefit from these databases.

MongoDB, CouchDB, Amazon SimpleDB, Riak, Lotus Notes. NoSQL document databases are MongoDB, CouchDB, Amazon SimpleDB, Riak, and Lotus Notes.

Distributed Storage and Processing Tools

A distributed database is a set of data storage chunks that is distributed over a network of computers. Data centres may have their own processing units for distributed databases. The distributed databases may be physically located in the same location or dispersed over an interconnected network of computers. The distributed databases are heterogeneous (having a variety of software and hardware), homogeneous (having the same software and hardware across all instances), and different, supported by distinct hardware.

The leading big data processing and distribution platforms are Hadoop HDFS, Snowflake, Qubole, Apache Spark, Azure HDInsight, Azure Data Lake, Amazon EMR, Google BigQuery, Google Cloud Dataflow, MS SQL.

Distributed databases are often equipped with the following tools:

The purpose of working with big data is to process it, and Hadoop is one of the most effective solutions for this purpose. It is used to store, process, and analyse huge amounts of data. Loading processing tools are required to process the vast amount of data.
The MapReduce programming model facilitates the efficient formatting and organisation of huge quantities of data into precise sets by performing operations of compilation and organisation of the data sets.
Hadoop is an open-source software project from the Apache Foundation that allows for the computation of computational software. Spark, another open-source project from the Apache Foundation, is to fasten computational computing software processes.

Cloud Computing Tools

Cloud Computing Tools refers to the network-based computing services that utilise the Internet’s development and services. The shared pool of configurable computing resources, which are available at any time and anywhere and at any time, are shared by all network-based services. This service is available for paid-for use when required and is provided by the service provider. The platform is very useful in handling large amounts of data.

Amazon Web Services (AWS) is the most popular cloud computing tool, followed by Microsoft Azure, Google Cloud, Blob Storage, and DataBricks. Oracle, IBM, and Alibaba are also popular cloud computing tools.

YouTube link : Hadoop series 1 - YouTube

Follow 👉 syed ashraf quadri👈 for awesome stuff

Cloud Stable

Wednesday, September 14, 2022

Big Data Tools and Techniques

Distributed Storage and Processing Tools

Cloud Computing Tools

No comments:

Post a Comment