HookStorm.md 5.21 KB
Newer Older
1 2
---
name: Storm
3
route: /HookStorm
4 5 6
menu: Documentation
submenu: Hooks
---
7

8 9 10 11 12 13 14
import  themen  from 'theme/styles/styled-colors';
import  * as theme  from 'react-syntax-highlighter/dist/esm/styles/hljs';
import SyntaxHighlighter from 'react-syntax-highlighter';

# Apache Atlas Hook for Apache Storm

## Introduction
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

Apache Storm is a distributed real-time computation system. Storm makes it
easy to reliably process unbounded streams of data, doing for real-time
processing what Hadoop did for batch processing. The process is essentially
a DAG of nodes, which is called *topology*.

Apache Atlas is a metadata repository that enables end-to-end data lineage,
search and associate business classification.

The goal of this integration is to push the operational topology
metadata along with the underlying data source(s), target(s), derivation
processes and any available business context so Atlas can capture the
lineage for this topology.

There are 2 parts in this process detailed below:
   * Data model to represent the concepts in Storm
   * Storm Atlas Hook to update metadata in Atlas


34
## Storm Data Model
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

A data model is represented as Types in Atlas. It contains the descriptions
of various nodes in the topology graph, such as spouts and bolts and the
corresponding producer and consumer types.

The following types are added in Atlas.

   * storm_topology - represents the coarse-grained topology. A storm_topology derives from an Atlas Process type and hence can be used to inform Atlas about lineage.
   * Following data sets are added - kafka_topic, jms_topic, hbase_table, hdfs_data_set. These all derive from an Atlas Dataset type and hence form the end points of a lineage graph.
   * storm_spout - Data Producer having outputs, typically Kafka, JMS
   * storm_bolt - Data Consumer having inputs and outputs, typically Hive, HBase, HDFS, etc.

The Storm Atlas hook auto registers dependent models like the Hive data model
if it finds that these are not known to the Atlas server.

The data model for each of the types is described in
the class definition at org.apache.atlas.storm.model.StormDataModel.

53
## Storm Atlas Hook
54 55 56 57 58 59 60 61 62 63

Atlas is notified when a new topology is registered successfully in
Storm. Storm provides a hook, backtype.storm.ISubmitterHook, at the Storm client used to
submit a storm topology.

The Storm Atlas hook intercepts the hook post execution and extracts the metadata from the
topology and updates Atlas using the types defined. Atlas implements the
Storm client hook interface in org.apache.atlas.storm.hook.StormAtlasHook.


64
## Limitations
65 66 67 68 69 70 71 72

The following apply for the first version of the integration.

   * Only new topology submissions are registered with Atlas, any lifecycle changes are not reflected in Atlas.
   * The Atlas server needs to be online when a Storm topology is submitted for the metadata to be captured.
   * The Hook currently does not support capturing lineage for custom spouts and bolts.


73
## Installation
74

75 76 77 78
The Storm Atlas Hook needs to be manually installed in Storm on the client side.
   * untar apache-atlas-${project.version}-storm-hook.tar.gz
   * cd apache-atlas-storm-hook-${project.version}
   * Copy entire contents of folder apache-atlas-storm-hook-${project.version}/hook/storm to $ATLAS_PACKAGE/hook/storm
79

80
Storm Atlas hook jars in $ATLAS_PACKAGE/hook/storm need to be copied to $STORM_HOME/extlib.
81 82 83 84 85
Replace STORM_HOME with storm installation path.

Restart all daemons after you have installed the atlas hook into Storm.


86
## Configuration
87

88
### Storm Configuration
89 90 91 92

The Storm Atlas Hook needs to be configured in Storm client config
in *$STORM_HOME/conf/storm.yaml* as:

93 94 95
<SyntaxHighlighter wrapLines={true} language="shell" style={theme.dark}>
{`storm.topology.submission.notifier.plugin.class: "org.apache.atlas.storm.hook.StormAtlasHook"`}
</SyntaxHighlighter>
96 97 98 99 100 101 102 103 104 105 106 107

Also set a 'cluster name' that would be used as a namespace for objects registered in Atlas.
This name would be used for namespacing the Storm topology, spouts and bolts.

The other objects like data sets should ideally be identified with the cluster name of
the components that generate them. For e.g. Hive tables and databases should be
identified using the cluster name set in Hive. The Storm Atlas hook will pick this up
if the Hive configuration is available in the Storm topology jar that is submitted on
the client and the cluster name is defined there. This happens similarly for HBase
data sets. In case this configuration is not available, the cluster name set in the Storm
configuration will be used.

108
<SyntaxHighlighter wrapLines={true} language="shell" style={theme.dark}>
109
atlas.cluster.name: "cluster_name"
110
</SyntaxHighlighter>
111 112 113

In *$STORM_HOME/conf/storm_env.ini*, set an environment variable as follows:

114
<SyntaxHighlighter wrapLines={true} language="shell" style={theme.dark}>
115
STORM_JAR_JVM_OPTS:"-Datlas.conf=$ATLAS_HOME/conf/"
116
</SyntaxHighlighter>
117 118 119 120 121

where ATLAS_HOME is pointing to where ATLAS is installed.

You could also set this up programatically in Storm Config as:

122 123 124 125 126 127
<SyntaxHighlighter wrapLines={true} language="shell" style={theme.dark}>
    {`Config stormConf = new Config();
        ...
        stormConf.put(Config.STORM_TOPOLOGY_SUBMISSION_NOTIFIER_PLUGIN,
                org.apache.atlas.storm.hook.StormAtlasHook.class.getName());`}
</SyntaxHighlighter>