Bridge-Hive.twiki 5.51 KB
Newer Older
1 2 3
---+ Hive Atlas Bridge

---++ Hive Model
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
The default hive model includes the following types:
   * Entity types:
      * hive_db
         * super-types: Referenceable
         * attributes: name, clusterName, description, locationUri, parameters, ownerName, ownerType
      * hive_storagedesc
         * super-types: Referenceable
         * attributes: cols, location, inputFormat, outputFormat, compressed, numBuckets, serdeInfo, bucketCols, sortCols, parameters, storedAsSubDirectories
      * hive_column
         * super-types: Referenceable
         * attributes: name, type, comment, table
      * hive_table
         * super-types: !DataSet
         * attributes: name, db, owner, createTime, lastAccessTime, comment, retention, sd, partitionKeys, columns, aliases, parameters, viewOriginalText, viewExpandedText, tableType, temporary
      * hive_process
         * super-types: Process
         * attributes: name, startTime, endTime, userName, operationType, queryText, queryPlan, queryId
      * hive_column_lineage
         * super-types: Process
         * attributes: query, depenendencyType, expression

   * Enum types:
      * hive_principal_type
         * values: USER, ROLE, GROUP

   * Struct types:
      * hive_order
         * attributes: col, order
      * hive_serde
         * attributes: name, serializationLib, parameters

The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying/lineage as well. Note that dbName, tableName and columnName should be in lower case. clusterName is explained below.
   * hive_db.qualifiedName     - <dbName>@<clusterName>
   * hive_table.qualifiedName  - <dbName>.<tableName>@<clusterName>
   * hive_column.qualifiedName - <dbName>.<tableName>.<columnName>@<clusterName>
   * hive_process.queryString  - trimmed query string in lower case
40 41 42


---++ Importing Hive Metadata
43
org.apache.atlas.hive.bridge.HiveMetaStoreBridge imports the Hive metadata into Atlas using the model defined above. import-hive.sh command can be used to facilitate this.
44
    <verbatim>
45
    Usage: <atlas package>/hook-bin/import-hive.sh</verbatim>
46

47 48
The logs are in <atlas package>/logs/import-hive.log

49 50

---++ Hive Hook
51 52 53
Atlas Hive hook registers with Hive to listen for create/update/delete operations and updates the metadata in Atlas, via Kafka notifications, for the changes in Hive.
Follow the instructions below to setup Atlas hook in Hive:
   * Set-up Atlas hook in hive-site.xml by adding the following:
54 55 56 57
  <verbatim>
    <property>
      <name>hive.exec.post.hooks</name>
      <value>org.apache.atlas.hive.hook.HiveHook</value>
58
    </property></verbatim>
59
   * Add 'export HIVE_AUX_JARS_PATH=<atlas package>/hook/hive' in hive-env.sh of your hive configuration
60
   * Copy <atlas-conf>/atlas-application.properties to the hive conf directory.
61

62
The following properties in <atlas-conf>/atlas-application.properties control the thread pool and notification details:
63 64 65 66
   * atlas.hook.hive.synchronous   - boolean, true to run the hook synchronously. default false. Recommended to be set to false to avoid delays in hive query completion.
   * atlas.hook.hive.numRetries    - number of retries for notification failure. default 3
   * atlas.hook.hive.minThreads    - core number of threads. default 1
   * atlas.hook.hive.maxThreads    - maximum number of threads. default 5
67
   * atlas.hook.hive.keepAliveTime - keep alive time in msecs. default 10
68
   * atlas.hook.hive.queueSize     - queue size for the threadpool. default 10000
69 70 71

Refer [[Configuration][Configuration]] for notification related configurations

72 73 74 75 76
---++ Column Level Lineage

Starting from 0.8-incubating version of Atlas, Column level lineage is captured in Atlas. Below are the details

---+++ Model
77
   * !ColumnLineageProcess type is a subtype of Process
78 79 80

   * This relates an output Column to a set of input Columns or the Input Table

81 82 83 84
   * The lineage also captures the kind of dependency, as listed below:
      * SIMPLE:     output column has the same value as the input
      * EXPRESSION: output column is transformed by some expression at runtime (for e.g. a Hive SQL expression) on the Input Columns.
      * SCRIPT:     output column is transformed by a user provided script.
85 86 87

   * In case of EXPRESSION dependency the expression attribute contains the expression in string form

88
   * Since Process links input and output !DataSets, Column is a subtype of !DataSet
89 90 91 92

---+++ Examples
For a simple CTAS below:
<verbatim>
93
create table t2 as select id, name from T1</verbatim>
94 95 96 97 98 99 100 101 102 103 104 105

The lineage is captured as

<img src="images/column_lineage_ex1.png" height="200" width="400" />



---+++ Extracting Lineage from Hive commands
  * The !HiveHook maps the !LineageInfo in the !HookContext to Column lineage instances

  * The !LineageInfo in Hive provides column-level lineage for the final !FileSinkOperator, linking them to the input columns in the Hive Query

106 107
---++ NOTES
   * Column level lineage works with Hive version 1.2.1 after the patch for <a href="https://issues.apache.org/jira/browse/HIVE-13112">HIVE-13112</a> is applied to Hive source
108
   * Since database name, table name and column names are case insensitive in hive, the corresponding names in entities are lowercase. So, any search APIs should use lowercase while querying on the entity names
109 110 111 112 113 114 115 116
   * The following hive operations are captured by hive hook currently
      * create database
      * create table/view, create table as select
      * load, import, export
      * DMLs (insert)
      * alter database
      * alter table (skewed table information, stored as, protection is not supported)
      * alter view