Cloud Stable: File format use in Big Data

Advantages of using appropriate file formats:

Faster read

Faster write

Splitable files support

Schema evolution can be supported

Advanced compression can be achieved

Some things to consider when choosing the format are:

The structure of your data: Some formats accept nested data such as JSON, Avro or Parquet and others do not. Even, the ones that do, may not be highly optimized for it. Avro is the most efficient format for nested data, I recommend not to use Parquet nested types because they are very inefficient. Process nested JSON is also very CPU intensive. In general, it is recommended to flat the data when ingesting it.

Performance: Some formats such as Avro and Parquet perform better than other such JSON. Even between Avro and Parquet for different use cases one will be better than others. For example, since Parquet is a column-based format, it is great to query your data lake using SQL whereas Avro is better for ETL row level transformation.

Easy to read: Consider if you need people to read the data or not. JSON or CSV are text formats and are human readable whereas more performant formats such parquet or Avro are binary.

Compression: Some formats offer higher compression rates than others.

Schema evolution: Adding or removing fields is far more complicated in a data lake than in a database. Some formats like Avro or Parquet provide some degree of schema evolution which allows you to change the data schema and still query the data. Tools such Delta Lake format provide even better tools to deal with changes in Schemas.

Compatibility: JSON or CSV are widely adopted and compatible with almost any tool while more performant options have less integration points

Big Data file formats

CSV

CSV files (comma-separated values) are usually used to exchange tabular data between systems using plain text. CSV is a row-based file format, which means that each row of the file is a row in the table. Essentially, CSV contains a header row that contains column names for the data, otherwise, files are considered partially structured. CSV files may not initially contain hierarchical or relational data. Data connections are usually established using multiple CSV files. Foreign keys are stored in columns of one or more files, but connections between these files are not expressed by the format itself. In addition, the CSV format is not fully standardized, and files may use separators other than commas, such as tabs or spaces.

One of the other properties of CSV files is that they are only splitable when it is a raw, uncompressed file or when splitable compression format is used such as bzip2 or lzo (note: lzo needs to be indexed to be splitable).

CSV is good option for compatibility, spreadsheet processing and human readable data. The data must be flat. It is not efficient and cannot handle nested data. There may be issues with the separator which can lead to data quality issues. Use this format for exploratory analysis, POCs or small data sets

Advantages

CSV is human-readable and easy to edit manually;

CSV provides a simple scheme;

CSV can be processed by almost all existing applications;

CSV is easy to implement and parse;

CSV is compact. For XML, you start a tag and end a tag for each column in each row. In CSV, the column headers are written only once.

Disadvantages

CSV allows you to work with flat data. Complex data structures have to be processed separately from the format;

No support for column types. No difference between text and numeric columns;

There is no standard way to present binary data;

Problems with CSV import (for example, no difference between NULL and quotes);

Poor support for special characters;

Lack of a universal standard.

JSON

JSON (JavaScript object notation) data are presented as key-value pairs in a partially structured format. JSON is often compared to XML because it can store data in a hierarchical format. Both formats are user-readable, but JSON documents are typically much smaller than XML. They are therefore more commonly used in network communication, especially with the rise of REST-based web services.

Since much data is already transmitted in JSON format, most web languages initially support JSON. With this huge support, JSON is used to represent data structures, exchange formats for hot data, and cold data warehouses.

Many streaming packages and modules support JSON serialization and deserialization. While the data contained in JSON documents can ultimately be stored in more performance-optimized formats such as Parquet or Avro, they serve as raw data, which is very important for data processing (if necessary).

Advantages

JSON supports hierarchical structures, simplifying the storage of related data in a single document and presenting complex relationships;

Most languages provide simplified JSON serialization libraries or built-in support for JSON serialization/deserialization;

JSON supports lists of objects, helping to avoid chaotic list conversion to a relational data model;

JSON is a widely used file format for NoSQL databases such as MongoDB, Couchbase and Azure Cosmos DB

Built-in support in most modern tools;

Disadvantages

JSON consumes more memory due to repeatable column names;

Poor support for special characters;

JSON is not very splitable;

JSON lacks indexing;

It is less compact as compared to over binary formats.

Parquet

Parquet: Columnar storage. It has schema support. It works very well with Hive and Spark as a way to store columnar data in deep storage that is queried using SQL. Because it stores data in columns, query engines will only read files that have the selected columns and not the entire data set as opposed to Avro. Use it as a reporting layer

Unlike CSV and JSON, parquet files are binary files that contain metadata about their contents. Therefore, without reading/parsing the contents of the file(s), Spark can simply rely on metadata to determine column names, compression/encoding, data types, and even some basic statistical characteristics. Column metadata for a Parquet file is stored at the end of the file, which allows for fast, single-pass writing.

Parquet is optimized for the paradigm Write Once Read Many (WORM). It writes slowly but reads incredibly quickly, especially when you only access a subset of columns. Parquet is good choice for heavy workloads when reading portions of data. For cases where you need to work with whole rows of data, you should use a format like CSV or AVRO.

Advantages

Parquet is a columnar format. Only the required columns will be retrieved/read, this reduces disk I/O. The concept is called projection pushdown.

The scheme travels with the data, so the data is self-describing;

Although it is designed for HDFS, data can be stored on other file systems such as GlusterFs or NFS;

Parquet just files, which means it's easy to work, move, backup and replicate them;

Built-in support in Spark makes it easy to simply take and save a file in storage;

Parquet provides very good compression up to 75% when using even compression formats like snappy;

As practice shows, this format is the fastest for read-heavy processes compared to other file formats;

Parquet is well suited for data storage solutions where aggregation on a particular column over a huge set of data is required;

Parquet can be read and written using the Avro API and Avro Schema (which gives the idea of storing all raw data in the Avro format, but all processed data in parquet);

Disadvantages

It also provides predicate pushdown, thus reducing the further cost of transferring data from storage to the processing engine for filtering;

The column-based design makes you think about the schema and data types;

Parquet does not always have built-in support in tools other than Spark;

It does not support data modification (Parquet files are immutable) and scheme evolution. Of course, Spark knows how to combine the schema if you change it over time (you must specify a special option while reading), but you can only change something in an existing file by overwriting it

Avro

Avro: Great for storing row data, very efficient. It has a schema and supports evolution. Great integration with Kafka. Supports file splitting. Use it for row level operations or in Kafka. Great to write data, slower to read.

Apache Avro was released by the Hadoop working group in 2009. It is a row-based format that has a high degree of splitting. It is also described as a data serialization system similar to Java Serialization. The schema is stored in JSON format, while the data is stored in binary format, which minimizes file size and maximizes efficiency. Avro has reliable support for schema evolution by managing added, missing, and changed fields. This allows old software to read new data, and new software to read old data — it is a critical feature if your data can change.

Avro's ability to manage scheme evolution allows components to be updated independently, at different times, with a low risk of incompatibility. This eliminates the need for applications to write if-else statements to handle different versions of schema and eliminates the need for the developer to look at old code to understand the old schema. Since all versions of the schema are stored in a human-readable JSON header, it is easy to understand all the fields available to you.

Since the schema is stored in JSON and the data is stored in binary form, Avro is a relatively compact option for both permanent storage and wire transfer. Since Avro is a row-based format, it is the preferred format for handling large amounts of records as it is easy to add new rows.

Advantages

Avro is a linguistic-neutral serialization of data.

Avro stores the schema in a file header, so the data is self-describing;

Easy and fast data serialization and deserialization, which can provide

very good ingestion performance.

As with the Sequence files, the Avro files also contain synchronization markers to separate blocks. This makes it highly splitable.

Files formatted in Avro are splitable and compressible and are therefore a good candidate for data storage in the Hadoop ecosystem.

The schema used to read Avro files does not necessarily have to be the same as the one used to write the files. This allows new fields to be added independently of each other.

Disadvantages

Makes you think about the schema and data types;

Its data is not human-readable;

Not integrated into every programming language.

ORC

ORC, short for Optimized Row Columnar, is a free and open-source columnar storage format designed for Hadoop workloads. As the name suggests, ORC is a self-describing, optimized file format that stores data in columns which enables users to read and decompress just the pieces they need. It is a successor to the traditional Record Columnar File (RCFile) format designed to overcome limitations of other Hive file formats. It takes significantly less time to access data and also reduces the size of the data up to 75 percent. ORC provides a more efficient and better way to store data to be accessed through SQL-on-Hadoop solutions such as Hive using Tez. ORC provides many advantages over other Hive file formats such as high data compression, faster performance, predictive push down feature, and moreover, the stored data is organized into stripes, which enable large, efficient reads from HDFS

Cloud Stable

Monday, September 5, 2022

File format use in Big Data

No comments:

Post a Comment