1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---+ Hive DGI Bridge
Hive metadata can be modelled in DGI using its Type System. The default modelling is available in org.apache.hadoop.metadata.hive.model.HiveDataModelGenerator. It defines the following types:
* hive_object_type(EnumType) - [GLOBAL, DATABASE, TABLE, PARTITION, COLUMN]
* hive_resource_type(EnumType) - [JAR, FILE, ARCHIVE]
* hive_principal_type(EnumType) - [USER, ROLE, GROUP]
* hive_function_type(EnumType) - [JAVA]
* hive_order(StructType) - [col, order]
* hive_resourceuri(StructType) - [resourceType, uri]
* hive_serde(StructType) - [name, serializationLib, parameters]
* hive_process(ClassType) - [name, startTime, endTime, userName, sourceTableNames, targetTableNames, queryText, queryPlan, queryId, queryGraph]
* hive_function(ClassType) - [functionName, dbName, className, ownerName, ownerType, createTime, functionType, resourceUris]
* hive_type(ClassType) - [name, type1, type2, fields]
* hive_partition(ClassType) - [values, dbName, tableName, createTime, lastAccessTime, sd, parameters]
* hive_storagedesc(ClassType) - [cols, location, inputFormat, outputFormat, compressed, numBuckets, serdeInfo, bucketCols, sortCols, parameters, storedAsSubDirectories]
* hive_index(ClassType) - [indexName, indexHandlerClass, dbName, createTime, lastAccessTime, origTableName, indexTableName, sd, parameters, deferredRebuild]
* hive_role(ClassType) - [roleName, createTime, ownerName]
* hive_column(ClassType) - [name, type, comment]
* hive_db(ClassType) - [name, description, locationUri, parameters, ownerName, ownerType]
* hive_table(ClassType) - [name, dbName, owner, createTime, lastAccessTime, retention, sd, partitionKeys, columns, parameters, viewOriginalText, viewExpandedText, tableType, temporary]
---++ Importing Hive Metadata
org.apache.hadoop.metadata.hive.bridge.HiveMetaStoreBridge imports the hive metadata into DGI using the typesystem defined in org.apache.hadoop.metadata.hive.model.HiveDataModelGenerator. import-hive.sh command can be used to facilitate this.
Set-up the following configs in hive-site.xml of your hive set-up and set environment variable HIVE_CONFIG to the
hive conf directory:
* DGI endpoint - Add the following property with the DGI endpoint for your set-up
<verbatim>
<property>
<name>hive.hook.dgi.url</name>
<value>http://localhost:21000/</value>
</property>
<property>
<name>hive.cluster.name</name>
<value>primary</value>
</property>
</verbatim>
Usage: <dgi package>/bin/import-hive.sh. The logs are in <dgi package>/logs/import-hive.log
---++ Hive Hook
Hive supports listeners on hive command execution using hive hooks. This is used to add/update/remove entities in DGI using the model defined in org.apache.hadoop.metadata.hive.model.HiveDataModelGenerator.
The hook submits the request to a thread pool executor to avoid blocking the command execution. Follow the these instructions in your hive set-up to add hive hook for DGI:
* Add org.apache.hadoop.metadata.hive.hook.HiveHook as post execution hook in hive-site.xml
<verbatim>
<property>
<name>hive.exec.post.hooks</name>
<value>org.apache.hadoop.metadata.hive.hook.HiveHook</value>
</property>
</verbatim>
* Add the following properties in hive-ste.xml with the DGI endpoint for your set-up
<verbatim>
<property>
<name>hive.hook.dgi.url</name>
<value>http://localhost:21000/</value>
</property>
<property>
<name>hive.cluster.name</name>
<value>primary</value>
</property>
</verbatim>
* Add 'export HIVE_AUX_JARS_PATH=<dgi package>/hook/hive' in hive-env.sh
The following properties in hive-site.xml control the thread pool details:
* hive.hook.dgi.minThreads - core number of threads. default 5
* hive.hook.dgi.maxThreads - maximum number of threads. default 5
* hive.hook.dgi.keepAliveTime - keep alive time in msecs. default 10
* hive.hook.dgi.synchronous - boolean, true to run the hook synchronously. default false