Cloud Stable: Apache Oozie Workflow Scheduler for Hadoop

Apache Oozie Workflow Scheduler for Hadoop

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).

Oozie is a scalable, reliable and extensible system.

What Oozie Does

Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack, with YARN as its architectural center, and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. Oozie can also schedule jobs specific to a system, like Java programs or shell scripts.

Apache Oozie is a tool for Hadoop operations that allows cluster administrators to build complex data transformations out of multiple component tasks. This provides greater control over jobs and also makes it easier to repeat those jobs at predetermined intervals. At its core, Oozie helps administrators derive more value from Hadoop.

There are two basic types of Oozie jobs:

Oozie Workflow jobs are Directed Acyclical Graphs (DAGs), specifying a sequence of actions to execute. The Workflow job has to wait.

Oozie Coordinator jobs are recurrent Oozie Workflow jobs that are triggered by time and data availability.

Oozie Bundle provides a way to package multiple coordinator and workflow jobs and to manage the lifecycle of those jobs

How Oozie Works

An Oozie Workflow is a collection of actions arranged in a Directed Acyclic Graph (DAG). Control nodes define job chronology, setting rules for beginning and ending a workflow. In this way, Oozie controls the workflow execution path with decision, fork and join nodes. Action nodes trigger the execution of tasks.

Oozie triggers workflow actions, but Hadoop MapReduce executes them. This allows Oozie to leverage other capabilities within the Hadoop stack to balance loads and handle failures. Oozie detects completion of tasks through callback and polling. When Oozie starts a task, it provides a unique callback HTTP URL to the task, thereby notifying that URL when it’s complete. If the task fails to invoke the callback URL, Oozie can poll the task for completion. Often it is necessary to run Oozie workflows on regular time intervals, but in coordination with unpredictable levels of data availability or events. In these circumstances, Oozie Coordinator allows you to model workflow execution triggers in the form of the data, time or event predicates. The workflow job is started after those predicates are satisfied. Oozie Coordinator can also manage multiple workflows that are dependent on the outcome of subsequent workflows. The outputs of subsequent workflows become the input to the next workflow. This chain is called a “data application pipeline”.

Oozie runs as a service in the cluster and clients submit workflow definitions for immediate or later processing. Oozie workflow consists of action nodes and control-flow nodes. An action node represents a workflow task, e.g., moving files into HDFS, running a MapReduce, Pig or Hive jobs, importing data using Sqoop or running a shell script of a program written in Java. A control-flow node controls the workflow execution between actions by allowing constructs like conditional logic wherein different branches may be followed depending on the result of earlier action node.

Start Node, End Node, and Error Node fall under this category of nodes.

Start Node, designates the start of the workflow job.

End Node, signals end of the job.

Error Node designates the occurrence of an error and corresponding error message to be printed.

At the end of execution of a workflow, HTTP callback is used by Oozie to update the client with the workflow status. Entry-to or exit from an action node may also trigger the callback.

Packaging and deploying an Oozie workflow application

A workflow application consists of the workflow definition and all the associated resources such as MapReduce Jar files, Pig scripts etc. Applications need to follow a simple directory structure and are deployed to HDFS so that Oozie can access them.

An example directory structure is shown below-

??? lib/

? ??? hadoop-examples.jar

??? workflow.xml

It is necessary to keep workflow.xml (a workflow definition file) in the top level directory (parent directory with workflow name). Lib directory contains Jar files containing MapReduce classes. Workflow application conforming to this layout can be built with any build tool e.g., Ant or Maven.

Such a build need to be copied to HDFS using a command, for example –

$ hadoop fs -put hadoop-examples/target/<name of workflow dir> name of workflow

Steps for Running an Oozie workflow job

To run this, we will use the Oozie command-line tool (a client program which communicates with the Oozie server)

1. Export OOZIE_URL environment variable which tells the oozie command which Oozie server to use (here we’re using one running locally):

$ export OOZIE_URL="http://localhost:11000/oozie"

2. Run workflow job using-

$ oozie job -config ch05/src/main/resources/max-temp-workflow.properties -run

The -config option refers to a local Java properties file containing definitions for the parameters in the workflow XML file, as well as oozie.wf.application.path, which tells Oozie the location of the workflow application in HDFS.

Example contents of the properties file:

nameNode=hdfs://localhost:8020

jobTracker=localhost:8021

oozie.wf.application.path=${nameNode}/user/${user.name}/<name of workflow>

3. Get the status of workflow job-

Status of workflow job can be seen using subcommand ‘job’ with ‘-info’ option and specifying job id after ‘-info’.

e.g., $ oozie job -info <job id>

Output shows status which is one of RUNNING, KILLED or SUCCEEDED.

4. Results of successful workflow execution can be seen using Hadoop command like-

$ hadoop fs -cat <location of result>

Why use Oozie?

The main purpose of using Oozie is to manage different type of jobs being processed in Hadoop system.

Dependencies between jobs are specified by a user in the form of Directed Acyclic Graphs. Oozie consumes this information and takes care of their execution in the correct order as specified in a workflow. That way user’s time to manage complete workflow is saved. In addition, Oozie has a provision to specify the frequency of execution of a particular job

Features of Oozie

Oozie has client API and command line interface which can be used to launch, control and monitor job from Java application.

Using its Web Service APIs one can control jobs from anywhere.

Oozie has provision to execute jobs which are scheduled to run periodically.

Oozie has provision to send email notifications upon completion of jobs.

Follow 👉 syed ashraf quadri👈 for awesome stuff

Cloud Stable

Wednesday, September 7, 2022

Apache Oozie Workflow Scheduler for Hadoop

Apache Oozie Workflow Scheduler for Hadoop

What Oozie Does

How Oozie Works

Packaging and deploying an Oozie workflow application

Steps for Running an Oozie workflow job

Why use Oozie?

Features of Oozie

No comments:

Post a Comment