1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---+ Hive Atlas Bridge
Hive metadata can be modelled in Atlas using its Type System. The default modelling is available in org.apache.atlas.hive.model.HiveDataModelGenerator. It defines the following types:
* hive_resource_type(EnumType) - [JAR, FILE, ARCHIVE]
* hive_principal_type(EnumType) - [USER, ROLE, GROUP]
* hive_function_type(EnumType) - [JAVA]
* hive_order(StructType) - [col, order]
* hive_resourceuri(StructType) - [resourceType, uri]
* hive_serde(StructType) - [name, serializationLib, parameters]
* hive_process(ClassType) - [name, startTime, endTime, userName, sourceTableNames, targetTableNames, queryText, queryPlan, queryId, queryGraph]
* hive_function(ClassType) - [functionName, dbName, className, ownerName, ownerType, createTime, functionType, resourceUris]
* hive_type(ClassType) - [name, type1, type2, fields]
* hive_partition(ClassType) - [values, dbName, tableName, createTime, lastAccessTime, sd, parameters]
* hive_storagedesc(ClassType) - [cols, location, inputFormat, outputFormat, compressed, numBuckets, serdeInfo, bucketCols, sortCols, parameters, storedAsSubDirectories]
* hive_index(ClassType) - [indexName, indexHandlerClass, dbName, createTime, lastAccessTime, origTableName, indexTableName, sd, parameters, deferredRebuild]
* hive_role(ClassType) - [roleName, createTime, ownerName]
* hive_column(ClassType) - [name, type, comment]
* hive_db(ClassType) - [name, description, locationUri, parameters, ownerName, ownerType]
* hive_table(ClassType) - [name, dbName, owner, createTime, lastAccessTime, retention, sd, partitionKeys, columns, parameters, viewOriginalText, viewExpandedText, tableType, temporary]
---++ Importing Hive Metadata
org.apache.atlas.hive.bridge.HiveMetaStoreBridge imports the hive metadata into Atlas using the typesystem defined in org.apache.atlas.hive.model.HiveDataModelGenerator. import-hive.sh command can be used to facilitate this.
Set-up the following configs in hive-site.xml of your hive set-up and set environment variable HIVE_CONFIG to the
hive conf directory:
* Atlas endpoint - Add the following property with the Atlas endpoint for your set-up
<verbatim>
<property>
<name>atlas.rest.address</name>
<value>http://localhost:21000/</value>
</property>
<property>
<name>atlas.cluster.name</name>
<value>primary</value>
</property>
</verbatim>
Usage: <dgi package>/bin/import-hive.sh. The logs are in <dgi package>/logs/import-hive.log
---++ Hive Hook
Hive supports listeners on hive command execution using hive hooks. This is used to add/update/remove entities in Atlas using the model defined in org.apache.atlas.hive.model.HiveDataModelGenerator.
The hook submits the request to a thread pool executor to avoid blocking the command execution. Follow the these instructions in your hive set-up to add hive hook for Atlas:
* Add org.apache.atlas.hive.hook.HiveHook as post execution hook in hive-site.xml
<verbatim>
<property>
<name>hive.exec.post.hooks</name>
<value>org.apache.atlas.hive.hook.HiveHook</value>
</property>
</verbatim>
* Add the following properties in hive-ste.xml with the Atlas endpoint for your set-up
<verbatim>
<property>
<name>atlas.rest.address</name>
<value>http://localhost:21000/</value>
</property>
<property>
<name>atlas.cluster.name</name>
<value>primary</value>
</property>
</verbatim>
* Add 'export HIVE_AUX_JARS_PATH=<dgi package>/hook/hive' in hive-env.sh
The following properties in hive-site.xml control the thread pool details:
* atlas.hook.hive.minThreads - core number of threads. default 5
* atlas.hook.hive.maxThreads - maximum number of threads. default 5
* atlas.hook.hive.keepAliveTime - keep alive time in msecs. default 10
* atlas.hook.hive.synchronous - boolean, true to run the hook synchronously. default false
---++ Limitations
* Since database name, table name and column names are case insensitive in hive, the corresponding names in entities are lowercase. So, any search APIs should use lowercase while querying on the entity names
* Only the following hive operations are captured by hive hook currently - create database, create table, create view, CTAS, load, import, export, query, alter table rename and alter view rename