External components like hive/sqoop/storm/falcon should model their taxonomy using typesystem and register the types with Atlas. For every entity created in this external component, the corresponding entity should be registered in Atlas as well.
This is typically done in a hook which runs in the external component and is called for every entity operation. Hook generally processes the entity asynchronously using a thread pool to avoid adding latency to the main operation.
The hook can then build the entity and register the entity using Atlas REST APIs. Howerver, any failure in APIs because of network issue etc can in result entity not registered in Atlas and hence inconsistent metadata.
Atlas exposes notification interface and can be used for reliable entity registration by hook as well. The hook can send notification message containing the list of entities to be registered. Atlas service contains hook consumer that listens to these messages and registers the entities.
Available bridges are:
* [[hive/Bridge-Hive][Hive Bridge]]
---++ Notification
Notification is used for reliable entity registration from hooks and for entity/type change notifications. Atlas, by default, provides kafka integration, but its possible to provide other implementations as well. Atlas service starts embedded kafka server by default.
Atlas also provides NotificationHookConsumer that runs in Atlas Service and listens to messages from hook and registers the entities in Atlas.
Hive metadata can be modelled in Atlas using its Type System. The default modelling is available in org.apache.atlas.hive.model.HiveDataModelGenerator. It defines the following types:
org.apache.atlas.hive.bridge.HiveMetaStoreBridge imports the hive metadata into Atlas using the typesystem defined in org.apache.atlas.hive.model.HiveDataModelGenerator. import-hive.sh command can be used to facilitate this.
org.apache.atlas.hive.bridge.HiveMetaStoreBridge imports the hive metadata into Atlas using the model defined in org.apache.atlas.hive.model.HiveDataModelGenerator. import-hive.sh command can be used to facilitate this.
Set-up the following configs in hive-site.xml of your hive set-up and set environment variable HIVE_CONFIG to the
Set-up the following configs in hive-site.xml of your hive set-up and set environment variable HIVE_CONFIG to the hive conf directory:
hive conf directory:
* Atlas endpoint - Add the following property with the Atlas endpoint for your set-up
* Atlas endpoint - Add the following property with the Atlas endpoint for your set-up
<verbatim>
<verbatim>
<property>
<property>
...
@@ -34,20 +41,20 @@ hive conf directory:
...
@@ -34,20 +41,20 @@ hive conf directory:
</property>
</property>
</verbatim>
</verbatim>
Usage: <dgi package>/bin/import-hive.sh. The logs are in <dgi package>/logs/import-hive.log
Usage: <atlas package>/bin/hive/import-hive.sh. The logs are in <atlas package>/logs/import-hive.log
---++ Hive Hook
---++ Hive Hook
Hive supports listeners on hive command execution using hive hooks. This is used to add/update/remove entities in Atlas using the model defined in org.apache.atlas.hive.model.HiveDataModelGenerator.
Hive supports listeners on hive command execution using hive hooks. This is used to add/update/remove entities in Atlas using the model defined in org.apache.atlas.hive.model.HiveDataModelGenerator.
The hook submits the request to a thread pool executor to avoid blocking the command execution. Follow the these instructions in your hive set-up to add hive hook for Atlas:
The hook submits the request to a thread pool executor to avoid blocking the command execution. The thread submits the entities as message to the notification server and atlas server reads these messages and registers the entities.
* Add org.apache.atlas.hive.hook.HiveHook as post execution hook in hive-site.xml
Follow these instructions in your hive set-up to add hive hook for Atlas:
* Set-up atlas hook and atlas endpoint in hive-site.xml:
* Add the following properties in hive-ste.xml with the Atlas endpoint for your set-up
<verbatim>
<verbatim>
<property>
<property>
<name>atlas.rest.address</name>
<name>atlas.rest.address</name>
...
@@ -58,15 +65,19 @@ The hook submits the request to a thread pool executor to avoid blocking the com
...
@@ -58,15 +65,19 @@ The hook submits the request to a thread pool executor to avoid blocking the com
<value>primary</value>
<value>primary</value>
</property>
</property>
</verbatim>
</verbatim>
* Add 'export HIVE_AUX_JARS_PATH=<dgi package>/hook/hive' in hive-env.sh
* Add 'export HIVE_AUX_JARS_PATH=<atlas package>/hook/hive' in hive-env.sh
* Copy <atlas package>/conf/application.properties to hive conf directory <hive package>/conf
The following properties in hive-site.xml control the thread pool details:
The following properties in hive-site.xml control the thread pool and notification details:
* atlas.hook.hive.synchronous - boolean, true to run the hook synchronously. default false
* atlas.hook.hive.numRetries - number of retries for notification failure. default 3
* atlas.hook.hive.minThreads - core number of threads. default 5
* atlas.hook.hive.minThreads - core number of threads. default 5
* atlas.hook.hive.maxThreads - maximum number of threads. default 5
* atlas.hook.hive.maxThreads - maximum number of threads. default 5
* atlas.hook.hive.keepAliveTime - keep alive time in msecs. default 10
* atlas.hook.hive.keepAliveTime - keep alive time in msecs. default 10
* atlas.hook.hive.synchronous - boolean, true to run the hook synchronously. default false
Refer [[Configuration][Configuration]] for notification related configurations
---++ Limitations
---++ Limitations
* Since database name, table name and column names are case insensitive in hive, the corresponding names in entities are lowercase. So, any search APIs should use lowercase while querying on the entity names
* Since database name, table name and column names are case insensitive in hive, the corresponding names in entities are lowercase. So, any search APIs should use lowercase while querying on the entity names
* Only the following hive operations are captured by hive hook currently - create database, create table, create view, CTAS, load, import, export, query, alter table rename and alter view rename
* Only the following hive operations are captured by hive hook currently - create database, create table, create view, CTAS, load, import, export, query, alter table rename and alter view rename