HookSqoop.md 3.9 KB
Newer Older
1 2
---
name: Sqoop
3
route: /HookSqoop
4 5 6
menu: Documentation
submenu: Hooks
---
7

8 9 10 11 12 13 14
import  themen  from 'theme/styles/styled-colors';
import  * as theme  from 'react-syntax-highlighter/dist/esm/styles/hljs';
import SyntaxHighlighter from 'react-syntax-highlighter';

# Apache Atlas Hook for Apache Sqoop

## Sqoop Model
15 16 17 18 19 20
Sqoop model includes the following types:
   * Entity types:
      * sqoop_process
         * super-types: Process
         * attributes: qualifiedName, name, description, owner, inputs, outputs, operation, commandlineOpts, startTime, endTime, userName
      * sqoop_dbdatastore
21
         * super-types: DataSet
22 23 24 25 26 27 28 29 30
         * attributes: qualifiedName, name, description, owner, dbStoreType, storeUse, storeUri, source
   * Enum types:
      * sqoop_operation_type
         * values: IMPORT, EXPORT, EVAL
      * sqoop_dbstore_usage
         * values: TABLE, QUERY, PROCEDURE, OTHER

Sqoop entities are created and de-duped in Atlas using unique attribute qualifiedName, whose value should be formatted as detailed below.

31 32 33 34 35 36
<SyntaxHighlighter wrapLines={true} language="shell" style={theme.dark}>
{`sqoop_process.qualifiedName:     sqoop <operation> --connect <url> {[--table <tableName>] || [--database <databaseName>]} [--query <storeQuery>]
sqoop_dbdatastore.qualifiedName: <storeType> --url <storeUri> {[--table <tableName>] || [--database <databaseName>]} [--query <storeQuery>]  --hive-<operation> --hive-database <databaseName> [--hive-table <tableName>] --hive-cluster <clusterName>`}
</SyntaxHighlighter>

## Sqoop Hook
37
Sqoop added a SqoopJobDataPublisher that publishes data to Atlas after completion of import Job. Today, only hiveImport is supported in SqoopHook.
38 39 40 41 42
This is used to add entities in Atlas using the model detailed above.

Follow the instructions below to setup Atlas hook in Hive:

Add the following properties to  to enable Atlas hook in Sqoop:
43
   * Set-up Atlas hook in `<sqoop-conf>`/sqoop-site.xml by adding the following:
44

45 46
<SyntaxHighlighter wrapLines={true} language="shell" style={theme.dark}>
{`<property>
47 48
     <name>sqoop.job.data.publish.class</name>
     <value>org.apache.atlas.sqoop.hook.SqoopHook</value>
49 50 51
   </property>`}
</SyntaxHighlighter>

52 53 54

   * untar apache-atlas-${project.version}-sqoop-hook.tar.gz
   * cd apache-atlas-sqoop-hook-${project.version}
55 56 57 58
   * Copy entire contents of folder apache-atlas-sqoop-hook-${project.version}/hook/sqoop to `<atlas package>`/hook/sqoop
   * Copy `<atlas-conf>`/atlas-application.properties to to the sqoop conf directory `<sqoop-conf>`/
   * Link `<atlas package>`/hook/sqoop/*.jar in sqoop lib

59 60 61


The following properties in atlas-application.properties control the thread pool and notification details:
62 63 64

<SyntaxHighlighter wrapLines={true} language="shell" style={theme.dark}>
{`atlas.hook.sqoop.synchronous=false # whether to run the hook synchronously. false recommended to avoid delays in Sqoop operation completion. Default: false
65 66 67 68 69 70
atlas.hook.sqoop.numRetries=3      # number of retries for notification failure. Default: 3
atlas.hook.sqoop.queueSize=10000   # queue size for the threadpool. Default: 10000
atlas.cluster.name=primary # clusterName to use in qualifiedName of entities. Default: primary
atlas.kafka.zookeeper.connect=                    # Zookeeper connect URL for Kafka. Example: localhost:2181
atlas.kafka.zookeeper.connection.timeout.ms=30000 # Zookeeper connection timeout. Default: 30000
atlas.kafka.zookeeper.session.timeout.ms=60000    # Zookeeper session timeout. Default: 60000
71 72
atlas.kafka.zookeeper.sync.time.ms=20             # Zookeeper sync time. Default: 20`}
</SyntaxHighlighter>
73

74
Other configurations for Kafka notification producer can be specified by prefixing the configuration name with "atlas.kafka.". For list of configuration supported by Kafka producer, please refer to [Kafka Producer Configs](http://kafka.apache.org/documentation/#producerconfigs)
75

76
## NOTES
77 78
   * Only the following sqoop operations are captured by sqoop hook currently
      * hiveImport