Advantages
of using appropriate file formats:
Faster read
Faster write
Splitable files support
Schema evolution can be supported
Advanced compression can be achieved
Some things to consider when choosing the
format are:
The structure of your data: Some formats accept nested data such
as JSON, Avro or Parquet and others do not. Even, the ones that do, may not be
highly optimized for it. Avro is the most efficient format for nested data, I
recommend not to use Parquet nested types because they are very inefficient. Process
nested JSON is also very CPU intensive. In general, it is recommended to flat
the data when ingesting it.
Performance: Some formats such as Avro and Parquet perform
better than other such JSON. Even between Avro and Parquet for different use
cases one will be better than others. For example, since Parquet is a column-based
format, it is great to query your data lake using SQL whereas Avro is better
for ETL row level transformation.
Easy to read: Consider if you need people to read the data or
not. JSON or CSV are text formats and are human readable whereas more
performant formats such parquet or Avro are binary.
Compression: Some formats offer higher compression rates than
others.
Schema evolution: Adding or removing fields is far more
complicated in a data lake than in a database. Some formats like Avro or
Parquet provide some degree of schema evolution which allows you to change the
data schema and still query the data. Tools such Delta Lake format
provide even better tools to deal with changes in Schemas.
Compatibility: JSON or CSV are widely adopted and compatible
with almost any tool while more performant options have less integration points
Big Data file formats
CSV
CSV files
(comma-separated values) are usually used to exchange tabular data between
systems using plain text. CSV is a row-based file format, which means that each
row of the file is a row in the table. Essentially, CSV contains a header row
that contains column names for the data, otherwise, files are considered
partially structured. CSV files may not initially contain hierarchical or
relational data. Data connections are usually established using multiple CSV
files. Foreign keys are stored in columns of one or more files, but connections
between these files are not expressed by the format itself. In addition, the
CSV format is not fully standardized, and files may use separators other than
commas, such as tabs or spaces.
One of the other
properties of CSV files is that they are only splitable when it is a raw,
uncompressed file or when splitable compression format is used such as bzip2 or lzo (note: lzo needs
to be indexed to be splitable).
CSV is good option for compatibility, spreadsheet processing and
human readable data. The data must be flat. It is not efficient and cannot
handle nested data. There may be issues with the separator which can lead to
data quality issues. Use this format for exploratory analysis, POCs or small
data sets
Advantages
CSV is human-readable and
easy to edit manually;
CSV provides a simple scheme;
CSV can be processed by almost all existing
applications;
CSV is easy to implement and parse;
CSV is compact. For XML, you start a tag and
end a tag for each column in each row. In CSV, the column headers are written
only once.
Disadvantages
CSV allows you to work
with flat data. Complex data structures have to be processed separately from
the format;
No support for column
types. No difference between text and numeric columns;
There is no standard way
to present binary data;
Problems with CSV import
(for example, no difference between NULL and quotes);
Poor support for special
characters;
Lack of a universal
standard.
JSON
JSON (JavaScript object notation) data are
presented as key-value pairs in a partially structured format. JSON is often
compared to XML because it can store data in a hierarchical format. Both
formats are user-readable, but JSON documents are typically much smaller than
XML. They are therefore more commonly used in network communication, especially
with the rise of REST-based web services.
Since much data is
already transmitted in JSON format, most web languages initially support JSON.
With this huge support, JSON is used to represent data structures, exchange
formats for hot data, and cold data warehouses.
Many streaming
packages and modules support JSON serialization and deserialization. While the
data contained in JSON documents can ultimately be stored in more
performance-optimized formats such as Parquet or Avro, they serve as raw data,
which is very important for data processing (if necessary).
JSON (JavaScript object notation) data are
presented as key-value pairs in a partially structured format. JSON is often
compared to XML because it can store data in a hierarchical format. Both
formats are user-readable, but JSON documents are typically much smaller than
XML. They are therefore more commonly used in network communication, especially
with the rise of REST-based web services.
Since much data is
already transmitted in JSON format, most web languages initially support JSON.
With this huge support, JSON is used to represent data structures, exchange
formats for hot data, and cold data warehouses.
Many streaming
packages and modules support JSON serialization and deserialization. While the
data contained in JSON documents can ultimately be stored in more
performance-optimized formats such as Parquet or Avro, they serve as raw data,
which is very important for data processing (if necessary)
Advantages
JSON supports hierarchical structures,
simplifying the storage of related data in a single document and presenting
complex relationships;
Most languages provide simplified JSON
serialization libraries or built-in support for JSON
serialization/deserialization;
JSON supports lists of objects, helping to
avoid chaotic list conversion to a relational data model;
JSON is a widely used file format for NoSQL
databases such as MongoDB, Couchbase and Azure Cosmos DB
Built-in support in most modern tools;
Disadvantages
JSON consumes more memory due to repeatable
column names;
Poor support for special characters;
JSON is not very splitable;
JSON lacks indexing;
It is less compact as compared to over binary
formats.
Parquet
Parquet: Columnar storage. It has schema support. It works very well
with Hive and Spark as a way to store columnar data in deep storage that is
queried using SQL. Because it stores data in columns, query engines will only
read files that have the selected columns and not the entire data set as
opposed to Avro. Use it as a reporting layer
Unlike CSV and JSON, parquet files are binary
files that contain metadata about their contents. Therefore, without
reading/parsing the contents of the file(s), Spark can simply rely on metadata
to determine column names, compression/encoding, data types, and even some
basic statistical characteristics. Column metadata for a Parquet file is stored
at the end of the file, which allows for fast, single-pass writing.
Parquet is optimized for the paradigm Write
Once Read Many (WORM). It writes slowly but reads incredibly quickly,
especially when you only access a subset of columns. Parquet is good choice for
heavy workloads when reading portions of data. For cases where you need to work
with whole rows of data, you should use a format like CSV or AVRO.
Advantages
Parquet is a columnar format. Only the
required columns will be retrieved/read, this reduces disk I/O. The concept is
called projection pushdown.
The scheme travels with the data, so the data
is self-describing;
Although it is designed for HDFS, data can be
stored on other file systems such as GlusterFs or NFS;
Parquet just files, which means it's easy to
work, move, backup and replicate them;
Built-in support in Spark makes it easy to
simply take and save a file in storage;
Parquet provides very good compression up to
75% when using even compression formats like snappy;
As practice shows, this format is the fastest
for read-heavy processes compared to other file formats;
Parquet is well suited for data storage
solutions where aggregation on a particular column over a huge set of data is
required;
Parquet can be read and written using the Avro
API and Avro Schema (which gives the idea of storing all raw data in the Avro
format, but all processed data in parquet);
Disadvantages
It also provides predicate pushdown, thus
reducing the further cost of transferring data from storage to the processing
engine for filtering;
The column-based design makes you think about
the schema and data types;
Parquet
does not always have built-in support in tools other than Spark;
It does
not support data modification (Parquet files are immutable) and scheme
evolution. Of course, Spark knows how to combine the schema if you change it
over time (you must specify a special option while reading), but you can only
change something in an existing file by overwriting it
Avro
Avro: Great for
storing row data, very efficient. It has a schema and supports evolution. Great
integration with Kafka. Supports file splitting. Use it for row level
operations or in Kafka. Great to write data, slower to read.
Apache Avro was
released by the Hadoop working group in 2009. It is a row-based format that has
a high degree of splitting. It is also described as a data serialization system
similar to Java Serialization. The schema is stored in JSON format, while the
data is stored in binary format, which minimizes file size and maximizes
efficiency. Avro has reliable support for schema evolution by managing added,
missing, and changed fields. This allows old software to read new data, and new
software to read old data — it is a critical feature if your data can change.
Avro's ability to
manage scheme evolution allows components to be updated independently, at
different times, with a low risk of incompatibility. This eliminates the need
for applications to write if-else statements to handle different versions of
schema and eliminates the need for the developer to look at old code to
understand the old schema. Since all versions of the schema are stored in a
human-readable JSON header, it is easy to understand all the fields available
to you.
Since the schema is
stored in JSON and the data is stored in binary form, Avro is a relatively
compact option for both permanent storage and wire transfer. Since Avro is a
row-based format, it is the preferred format for handling large amounts of
records as it is easy to add new rows.
Advantages
Avro is a linguistic-neutral serialization of
data.
Avro stores the schema
in a file header, so the data is self-describing;
Easy and fast data serialization and
deserialization, which can provide
very good ingestion
performance.
As with the Sequence files, the Avro files
also contain synchronization markers to separate blocks. This makes it highly splitable.
Files formatted in Avro are splitable
and compressible and are therefore a good candidate for data storage
in the Hadoop ecosystem.
The schema used to read Avro files does not
necessarily have to be the same as the one used to write the files. This allows
new fields to be added independently of each other.
Disadvantages
Makes you think about the schema and data
types;
Its data is not human-readable;
Not integrated into every programming
language.
ORC
ORC, short for
Optimized Row Columnar, is a free and open-source columnar storage format
designed for Hadoop workloads. As the name suggests, ORC is a self-describing,
optimized file format that stores data in columns which enables users to read
and decompress just the pieces they need. It is a successor to the traditional
Record Columnar File (RCFile) format designed to overcome limitations of other
Hive file formats. It takes significantly less time to access data and also
reduces the size of the data up to 75 percent. ORC provides a more efficient
and better way to store data to be accessed through SQL-on-Hadoop solutions
such as Hive using Tez. ORC provides many advantages over other Hive file
formats such as high data compression, faster performance, predictive push down
feature, and moreover, the stored data is organized into stripes, which enable
large, efficient reads from HDFS
No comments:
Post a Comment