Commit 880ea4b6 by Madhan Neethiraj

ATLAS-2647: updated documentation on notification, hooks and basic-search

parent 1fc88ce3
...@@ -48,11 +48,11 @@ notification events. Events are written by the hooks and Atlas to different Kafk ...@@ -48,11 +48,11 @@ notification events. Events are written by the hooks and Atlas to different Kafk
Atlas supports integration with many sources of metadata out of the box. More integrations will be added in future Atlas supports integration with many sources of metadata out of the box. More integrations will be added in future
as well. Currently, Atlas supports ingesting and managing metadata from the following sources: as well. Currently, Atlas supports ingesting and managing metadata from the following sources:
* [[Bridge-Hive][Hive]] * [[Hook-HBase][HBase]]
* [[Bridge-Sqoop][Sqoop]] * [[Hook-Hive][Hive]]
* [[Bridge-Falcon][Falcon]] * [[Hook-Sqoop][Sqoop]]
* [[StormAtlasHook][Storm]] * [[Hook-Storm][Storm]]
* HBase - _documentation work-in-progress_ * [[Bridge-Kafka][Kafka]]
The integration implies two things: The integration implies two things:
There are metadata models that Atlas defines natively to represent objects of these components. There are metadata models that Atlas defines natively to represent objects of these components.
......
---+ HBase Atlas Bridge
---++ HBase Model
The default HBase model includes the following types:
* Entity types:
* hbase_namespace
* super-types: !Asset
* attributes: name, owner, description, type, classifications, term, clustername, parameters, createtime, modifiedtime, qualifiedName
* hbase_table
* super-types: !DataSet
* attributes: name, owner, description, type, classifications, term, uri, column_families, namespace, parameters, createtime, modifiedtime, maxfilesize,
isReadOnly, isCompactionEnabled, isNormalizationEnabled, ReplicaPerRegion, Durability, qualifiedName
* hbase_column_family
* super-types: !DataSet
* attributes: name, owner, description, type, classifications, term, columnns, createtime, bloomFilterType, compressionType, CompactionCompressionType, EncryptionType,
inMemoryCompactionPolicy, keepDeletedCells, Maxversions, MinVersions, datablockEncoding, storagePolicy, Ttl, blockCachedEnabled, cacheBloomsOnWrite,
cacheDataOnWrite, EvictBlocksOnClose, PerfectBlocksOnOpen, NewVersionsBehavior, isMobEnbaled, MobCompactPartitionPolicy, qualifiedName
The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying as well:
* hbase_namespace.qualifiedName - <namespace>@<clusterName>
* hbase_table.qualifiedName - <namespace>:<tableName>@<clusterName>
* hbase_column_family.qualifiedName - <namespace>:<tableName>.<columnFamily>@<clusterName>
---++ Importing HBase Metadata
org.apache.atlas.hbase.bridge.HBaseBridge imports the HBase metadata into Atlas using the model defined above. import-hbase.sh command can be used to facilitate this.
<verbatim>
Usage 1: <atlas package>/hook-bin/import-hbase.sh
Usage 2: <atlas package>/hook-bin/import-hbase.sh [-n <namespace regex> OR --namespace <namespace regex >] [-t <table regex > OR --table <table regex>]
Usage 3: <atlas package>/hook-bin/import-hbase.sh [-f <filename>]
File Format:
namespace1:tbl1
namespace1:tbl2
namespace2:tbl1
</verbatim>
The logs are in <atlas package>/logs/import-hbase.log
---++ HBase Hook
Atlas HBase hook registers with HBase to listen for create/update/delete operations and updates the metadata in Atlas, via Kafka notifications, for the changes in HBase.
Follow the instructions below to setup Atlas hook in HBase:
* Set-up Atlas hook in hbase-site.xml by adding the following:
<verbatim>
<property>
<name>hbase.coprocessor.master.classes</name>
<value>org.apache.atlas.hbase.hook.HBaseAtlasCoprocessor</value>
</property></verbatim>
* Copy <atlas package>/hook/hbase/<All files and folder> to hbase class path. HBase hook binary files are present in apache-atlas-<release-vesion>-SNAPSHOT-hbase-hook.tar.gz
* Copy <atlas-conf>/atlas-application.properties to the hbase conf directory.
The following properties in <atlas-conf>/atlas-application.properties control the thread pool and notification details:
* atlas.hook.hbase.synchronous - boolean, true to run the hook synchronously. default false. Recommended to be set to false to avoid delays in Hbase operation.
* atlas.hook.hbase.numRetries - number of retries for notification failure. default 3
* atlas.hook.hbase.minThreads - core number of threads. default 1
* atlas.hook.hbase.maxThreads - maximum number of threads. default 5
* atlas.hook.hbase.keepAliveTime - keep alive time in msecs. default 10
* atlas.hook.hbase.queueSize - queue size for the threadpool. default 10000
Refer [[Configuration][Configuration]] for notification related configurations
---++ NOTES
* Only the namespace, table and columnfamily create / update / delete operations are caputured by the hook. Columns changes wont be captured and propagated.
\ No newline at end of file
---+ Kafka Atlas Bridge ---+ Apache Atlas Hook for Apache Kafka
---++ Kafka Model ---++ Kafka Model
The default Kafka model includes the following types: Kafka model includes the following types:
* Entity types: * Entity types:
* kafka_topic * kafka_topic
* super-types: !DataSet * super-types: !DataSet
* attributes: name, owner, description, type, classifications, term, clustername, topic , partitionCount, qualifiedName * attributes: qualifiedName, name, description, owner, topic, uri, partitionCount
The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying as well: Kafka entities are created and de-duped in Atlas using unique attribute qualifiedName, whose value should be formatted as detailed below.
* topic.qualifiedName - <topic>@<clusterName> Note that qualifiedName will have topic name in lower case.
<verbatim>
topic.qualifiedName: <topic>@<clusterName>
</verbatim>
---++ Setup ---++ Setup
binary files are present in apache-atlas-<release-vesion>-SNAPSHOT-kafka-hook.tar.gz Binary files are present in apache-atlas-<release-version>-kafka-hook.tar.gz
Copy apache-atlas-kafka-hook-<release-verion>-SNAPSHOT/hook/kafka folder to <atlas package>/hook/ directory
Copy apache-atlas-kafka-hook-<release-verion>-SNAPSHOT/hook-bin folder to <atlas package>/hook-bin/ directory Copy apache-atlas-kafka-hook-<release-version>/hook/kafka folder to <atlas package>/hook/ directory
Copy apache-atlas-kafka-hook-<release-version>/hook-bin folder to <atlas package>/hook-bin directory
* Copy <atlas-conf>/atlas-application.properties to the Kafka conf directory.
---++ Importing Kafka Metadata ---++ Importing Kafka Metadata
org.apache.atlas.Kafka.bridge.KafkaBridge imports the Kafka metadata into Atlas using the model defined above. import-kafka.sh command can be used to facilitate this. Apache Atlas provides a command-line utility, import-kafka.sh, to import metadata of Apache Kafka topics into Apache Atlas.
<verbatim> This utility can be used to initialize Apache Atlas with topics present in Apache Kafka.
Usage 1: <atlas package>/hook-bin/import-kafka.sh This utility supports importing metadata of a specific topic or all topics.
Usage 2: <atlas package>/hook-bin/import-kafka.sh [-n <namespace regex> OR --namespace <namespace regex >] [-t <table regex > OR --table <table regex>]
Usage 3: <atlas package>/hook-bin/import-kafka.sh [-f <filename>] <verbatim>
File Format: Usage 1: <atlas package>/hook-bin/import-kafka.sh
topic1 Usage 2: <atlas package>/hook-bin/import-kafka.sh [-t <topic prefix> OR --topic <topic prefix>]
topic2 Usage 3: <atlas package>/hook-bin/import-kafka.sh [-f <filename>]
topic3 File Format:
</verbatim> topic1
topic2
The logs are in <atlas package>/logs/import-kafka.log topic3
</verbatim>
Refer [[Configuration][Configuration]] for notification related configurations
---+ Sqoop Atlas Bridge
---++ Sqoop Model
The default hive model includes the following types:
* Entity types:
* sqoop_process
* super-types: Process
* attributes: name, operation, dbStore, hiveTable, commandlineOpts, startTime, endTime, userName
* sqoop_dbdatastore
* super-types: !DataSet
* attributes: name, dbStoreType, storeUse, storeUri, source, description, ownerName
* Enum types:
* sqoop_operation_type
* values: IMPORT, EXPORT, EVAL
* sqoop_dbstore_usage
* values: TABLE, QUERY, PROCEDURE, OTHER
The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying as well:
* sqoop_process.qualifiedName - dbStoreType-storeUri-endTime
* sqoop_dbdatastore.qualifiedName - dbStoreType-storeUri-source
---++ Sqoop Hook
Sqoop added a !SqoopJobDataPublisher that publishes data to Atlas after completion of import Job. Today, only hiveImport is supported in !SqoopHook.
This is used to add entities in Atlas using the model detailed above.
Follow the instructions below to setup Atlas hook in Hive:
Add the following properties to to enable Atlas hook in Sqoop:
* Set-up Atlas hook in <sqoop-conf>/sqoop-site.xml by adding the following:
<verbatim>
<property>
<name>sqoop.job.data.publish.class</name>
<value>org.apache.atlas.sqoop.hook.SqoopHook</value>
</property></verbatim>
* Copy <atlas-conf>/atlas-application.properties to to the sqoop conf directory <sqoop-conf>/
* Link <atlas-home>/hook/sqoop/*.jar in sqoop lib
Refer [[Configuration][Configuration]] for notification related configurations
---++ NOTES
* Only the following sqoop operations are captured by sqoop hook currently - hiveImport
---+ Apache Atlas Hook & Bridge for Apache HBase
---++ HBase Model
HBase model includes the following types:
* Entity types:
* hbase_namespace
* super-types: !Asset
* attributes: qualifiedName, name, description, owner, clusterName, parameters, createTime, modifiedTime
* hbase_table
* super-types: !DataSet
* attributes: qualifiedName, name, description, owner, namespace, column_families, uri, parameters, createtime, modifiedtime, maxfilesize, isReadOnly, isCompactionEnabled, isNormalizationEnabled, ReplicaPerRegion, Durability
* hbase_column_family
* super-types: !DataSet
* attributes: qualifiedName, name, description, owner, columns, createTime, bloomFilterType, compressionType, compactionCompressionType, encryptionType, inMemoryCompactionPolicy, keepDeletedCells, maxversions, minVersions, datablockEncoding, storagePolicy, ttl, blockCachedEnabled, cacheBloomsOnWrite, cacheDataOnWrite, evictBlocksOnClose, prefetchBlocksOnOpen, newVersionsBehavior, isMobEnabled, mobCompactPartitionPolicy
HBase entities are created and de-duped in Atlas using unique attribute qualifiedName, whose value should be formatted as detailed below. Note that namespaceName, tableName and columnFamilyName should be in lower case.
<verbatim>
hbase_namespace.qualifiedName: <namespaceName>@<clusterName>
hbase_table.qualifiedName: <namespaceName>:<tableName>@<clusterName>
hbase_column_family.qualifiedName: <namespaceName>:<tableName>.<columnFamilyName>@<clusterName>
</verbatim>
---++ HBase Hook
Atlas HBase hook registers with HBase master as a co-processor. On detecting changes to HBase namespaces/tables/column-families, Atlas hook updates the metadata in Atlas via Kafka notifications.
Follow the instructions below to setup Atlas hook in HBase:
* Register Atlas hook in hbase-site.xml by adding the following:
<verbatim>
<property>
<name>hbase.coprocessor.master.classes</name>
<value>org.apache.atlas.hbase.hook.HBaseAtlasCoprocessor</value>
</property></verbatim>
* Copy entire contents of folder <atlas package>/hook/hbase to HBase class path.
* Copy <atlas-conf>/atlas-application.properties to the HBase conf directory.
The following properties in atlas-application.properties control the thread pool and notification details:
<verbatim>
atlas.hook.hbase.synchronous=false # whether to run the hook synchronously. false recommended to avoid delays in HBase operations. Default: false
atlas.hook.hbase.numRetries=3 # number of retries for notification failure. Default: 3
atlas.hook.hbase.queueSize=10000 # queue size for the threadpool. Default: 10000
atlas.cluster.name=primary # clusterName to use in qualifiedName of entities. Default: primary
atlas.kafka.zookeeper.connect= # Zookeeper connect URL for Kafka. Example: localhost:2181
atlas.kafka.zookeeper.connection.timeout.ms=30000 # Zookeeper connection timeout. Default: 30000
atlas.kafka.zookeeper.session.timeout.ms=60000 # Zookeeper session timeout. Default: 60000
atlas.kafka.zookeeper.sync.time.ms=20 # Zookeeper sync time. Default: 20
</verbatim>
Other configurations for Kafka notification producer can be specified by prefixing the configuration name with "atlas.kafka.".
For list of configuration supported by Kafka producer, please refer to [[http://kafka.apache.org/documentation/#producerconfigs][Kafka Producer Configs]]
---++ NOTES
* Only the namespace, table and column-family create/update/ delete operations are captured by Atlas HBase hook. Changes to columns are be captured.
---++ Importing HBase Metadata
Apache Atlas provides a command-line utility, import-hbase.sh, to import metadata of Apache HBase namespaces and tables into Apache Atlas.
This utility can be used to initialize Apache Atlas with namespaces/tables present in a Apache HBase cluster.
This utility supports importing metadata of a specific table, tables in a specific namespace or all tables.
<verbatim>
Usage 1: <atlas package>/hook-bin/import-hbase.sh
Usage 2: <atlas package>/hook-bin/import-hbase.sh [-n <namespace regex> OR --namespace <namespace regex>] [-t <table regex> OR --table <table regex>]
Usage 3: <atlas package>/hook-bin/import-hbase.sh [-f <filename>]
File Format:
namespace1:tbl1
namespace1:tbl2
namespace2:tbl1
</verbatim>
---+ Hive Atlas Bridge ---+ Apache Atlas Hook & Bridge for Apache Hive
---++ Hive Model ---++ Hive Model
The default hive model includes the following types: Hive model includes the following types:
* Entity types: * Entity types:
* hive_db * hive_db
* super-types: Referenceable * super-types: !Asset
* attributes: name, clusterName, description, locationUri, parameters, ownerName, ownerType * attributes: qualifiedName, name, description, owner, clusterName, location, parameters, ownerName
* hive_storagedesc
* super-types: Referenceable
* attributes: cols, location, inputFormat, outputFormat, compressed, numBuckets, serdeInfo, bucketCols, sortCols, parameters, storedAsSubDirectories
* hive_column
* super-types: Referenceable
* attributes: name, type, comment, table
* hive_table * hive_table
* super-types: !DataSet * super-types: !DataSet
* attributes: name, db, owner, createTime, lastAccessTime, comment, retention, sd, partitionKeys, columns, aliases, parameters, viewOriginalText, viewExpandedText, tableType, temporary * attributes: qualifiedName, name, description, owner, db, createTime, lastAccessTime, comment, retention, sd, partitionKeys, columns, aliases, parameters, viewOriginalText, viewExpandedText, tableType, temporary
* hive_column
* super-types: !DataSet
* attributes: qualifiedName, name, description, owner, type, comment, table
* hive_storagedesc
* super-types: Referenceable
* attributes: qualifiedName, table, location, inputFormat, outputFormat, compressed, numBuckets, serdeInfo, bucketCols, sortCols, parameters, storedAsSubDirectories
* hive_process * hive_process
* super-types: Process * super-types: Process
* attributes: name, startTime, endTime, userName, operationType, queryText, queryPlan, queryId * attributes: qualifiedName, name, description, owner, inputs, outputs, startTime, endTime, userName, operationType, queryText, queryPlan, queryId, clusterName
* hive_column_lineage * hive_column_lineage
* super-types: Process * super-types: Process
* attributes: query, depenendencyType, expression * attributes: qualifiedName, name, description, owner, inputs, outputs, query, depenendencyType, expression
* Enum types: * Enum types:
* hive_principal_type * hive_principal_type
...@@ -32,19 +32,13 @@ The default hive model includes the following types: ...@@ -32,19 +32,13 @@ The default hive model includes the following types:
* hive_serde * hive_serde
* attributes: name, serializationLib, parameters * attributes: name, serializationLib, parameters
The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying/lineage as well. Note that dbName, tableName and columnName should be in lower case. clusterName is explained below. Hive entities are created and de-duped in Atlas using unique attribute qualifiedName, whose value should be formatted as detailed below. Note that dbName, tableName and columnName should be in lower case.
* hive_db.qualifiedName - <dbName>@<clusterName> <verbatim>
* hive_table.qualifiedName - <dbName>.<tableName>@<clusterName> hive_db.qualifiedName: <dbName>@<clusterName>
* hive_column.qualifiedName - <dbName>.<tableName>.<columnName>@<clusterName> hive_table.qualifiedName: <dbName>.<tableName>@<clusterName>
* hive_process.queryString - trimmed query string in lower case hive_column.qualifiedName: <dbName>.<tableName>.<columnName>@<clusterName>
hive_process.queryString: trimmed query string in lower case
</verbatim>
---++ Importing Hive Metadata
org.apache.atlas.hive.bridge.HiveMetaStoreBridge imports the Hive metadata into Atlas using the model defined above. import-hive.sh command can be used to facilitate this.
<verbatim>
Usage: <atlas package>/hook-bin/import-hive.sh</verbatim>
The logs are in <atlas package>/logs/import-hive.log
---++ Hive Hook ---++ Hive Hook
...@@ -59,15 +53,21 @@ Follow the instructions below to setup Atlas hook in Hive: ...@@ -59,15 +53,21 @@ Follow the instructions below to setup Atlas hook in Hive:
* Add 'export HIVE_AUX_JARS_PATH=<atlas package>/hook/hive' in hive-env.sh of your hive configuration * Add 'export HIVE_AUX_JARS_PATH=<atlas package>/hook/hive' in hive-env.sh of your hive configuration
* Copy <atlas-conf>/atlas-application.properties to the hive conf directory. * Copy <atlas-conf>/atlas-application.properties to the hive conf directory.
The following properties in <atlas-conf>/atlas-application.properties control the thread pool and notification details: The following properties in atlas-application.properties control the thread pool and notification details:
* atlas.hook.hive.synchronous - boolean, true to run the hook synchronously. default false. Recommended to be set to false to avoid delays in hive query completion. <verbatim>
* atlas.hook.hive.numRetries - number of retries for notification failure. default 3 atlas.hook.hive.synchronous=false # whether to run the hook synchronously. false recommended to avoid delays in Hive query completion. Default: false
* atlas.hook.hive.minThreads - core number of threads. default 1 atlas.hook.hive.numRetries=3 # number of retries for notification failure. Default: 3
* atlas.hook.hive.maxThreads - maximum number of threads. default 5 atlas.hook.hive.queueSize=10000 # queue size for the threadpool. Default: 10000
* atlas.hook.hive.keepAliveTime - keep alive time in msecs. default 10
* atlas.hook.hive.queueSize - queue size for the threadpool. default 10000 atlas.cluster.name=primary # clusterName to use in qualifiedName of entities. Default: primary
atlas.kafka.zookeeper.connect= # Zookeeper connect URL for Kafka. Example: localhost:2181
atlas.kafka.zookeeper.connection.timeout.ms=30000 # Zookeeper connection timeout. Default: 30000
atlas.kafka.zookeeper.session.timeout.ms=60000 # Zookeeper session timeout. Default: 60000
atlas.kafka.zookeeper.sync.time.ms=20 # Zookeeper sync time. Default: 20
</verbatim>
Refer [[Configuration][Configuration]] for notification related configurations Other configurations for Kafka notification producer can be specified by prefixing the configuration name with "atlas.kafka.". For list of configuration supported by Kafka producer, please refer to [[http://kafka.apache.org/documentation/#producerconfigs][Kafka Producer Configs]]
---++ Column Level Lineage ---++ Column Level Lineage
...@@ -114,3 +114,19 @@ The lineage is captured as ...@@ -114,3 +114,19 @@ The lineage is captured as
* alter database * alter database
* alter table (skewed table information, stored as, protection is not supported) * alter table (skewed table information, stored as, protection is not supported)
* alter view * alter view
---++ Importing Hive Metadata
Apache Atlas provides a command-line utility, import-hive.sh, to import metadata of Apache Hive databases and tables into Apache Atlas.
This utility can be used to initialize Apache Atlas with databases/tables present in Apache Hive.
This utility supports importing metadata of a specific table, tables in a specific database or all databases and tables.
<verbatim>
Usage 1: <atlas package>/hook-bin/import-hive.sh
Usage 2: <atlas package>/hook-bin/import-hive.sh [-d <database regex> OR --database <database regex>] [-t <table regex> OR --table <table regex>]
Usage 3: <atlas package>/hook-bin/import-hive.sh [-f <filename>]
File Format:
database1:tbl1
database1:tbl2
database2:tbl1
</verbatim>
---+ Apache Atlas Hook for Apache Sqoop
---++ Sqoop Model
Sqoop model includes the following types:
* Entity types:
* sqoop_process
* super-types: Process
* attributes: qualifiedName, name, description, owner, inputs, outputs, operation, commandlineOpts, startTime, endTime, userName
* sqoop_dbdatastore
* super-types: !DataSet
* attributes: qualifiedName, name, description, owner, dbStoreType, storeUse, storeUri, source
* Enum types:
* sqoop_operation_type
* values: IMPORT, EXPORT, EVAL
* sqoop_dbstore_usage
* values: TABLE, QUERY, PROCEDURE, OTHER
Sqoop entities are created and de-duped in Atlas using unique attribute qualifiedName, whose value should be formatted as detailed below.
<verbatim>
sqoop_process.qualifiedName: sqoop <operation> --connect <url> {[--table <tableName>] || [--database <databaseName>]} [--query <storeQuery>]
sqoop_dbdatastore.qualifiedName: <storeType> --url <storeUri> {[--table <tableName>] || [--database <databaseName>]} [--query <storeQuery>] --hive-<operation> --hive-database <databaseName> [--hive-table <tableName>] --hive-cluster <clusterName>
</verbatim>
---++ Sqoop Hook
Sqoop added a !SqoopJobDataPublisher that publishes data to Atlas after completion of import Job. Today, only hiveImport is supported in !SqoopHook.
This is used to add entities in Atlas using the model detailed above.
Follow the instructions below to setup Atlas hook in Hive:
Add the following properties to to enable Atlas hook in Sqoop:
* Set-up Atlas hook in <sqoop-conf>/sqoop-site.xml by adding the following:
<verbatim>
<property>
<name>sqoop.job.data.publish.class</name>
<value>org.apache.atlas.sqoop.hook.SqoopHook</value>
</property></verbatim>
* Copy <atlas-conf>/atlas-application.properties to to the sqoop conf directory <sqoop-conf>/
* Link <atlas-home>/hook/sqoop/*.jar in sqoop lib
The following properties in atlas-application.properties control the thread pool and notification details:
<verbatim>
atlas.hook.sqoop.synchronous=false # whether to run the hook synchronously. false recommended to avoid delays in Sqoop operation completion. Default: false
atlas.hook.sqoop.numRetries=3 # number of retries for notification failure. Default: 3
atlas.hook.sqoop.queueSize=10000 # queue size for the threadpool. Default: 10000
atlas.cluster.name=primary # clusterName to use in qualifiedName of entities. Default: primary
atlas.kafka.zookeeper.connect= # Zookeeper connect URL for Kafka. Example: localhost:2181
atlas.kafka.zookeeper.connection.timeout.ms=30000 # Zookeeper connection timeout. Default: 30000
atlas.kafka.zookeeper.session.timeout.ms=60000 # Zookeeper session timeout. Default: 60000
atlas.kafka.zookeeper.sync.time.ms=20 # Zookeeper sync time. Default: 20
</verbatim>
Other configurations for Kafka notification producer can be specified by prefixing the configuration name with "atlas.kafka.". For list of configuration supported by Kafka producer, please refer to [[http://kafka.apache.org/documentation/#producerconfigs][Kafka Producer Configs]]
---++ NOTES
* Only the following sqoop operations are captured by sqoop hook currently
* hiveImport
---+ Storm Atlas Bridge ---+ Apache Atlas Hook for Apache Storm
---++ Introduction ---++ Introduction
......
---+ Entity Change Notifications
To receive Atlas entity notifications a consumer should be obtained through the notification interface. Entity change notifications are sent every time a change is made to an entity. Operations that result in an entity change notification are:
* <code>ENTITY_CREATE</code> - Create a new entity.
* <code>ENTITY_UPDATE</code> - Update an attribute of an existing entity.
* <code>TRAIT_ADD</code> - Add a trait to an entity.
* <code>TRAIT_DELETE</code> - Delete a trait from an entity.
<verbatim>
// Obtain provider through injection…
Provider<NotificationInterface> provider;
// Get the notification interface
NotificationInterface notification = provider.get();
// Create consumers
List<NotificationConsumer<EntityNotification>> consumers =
notification.createConsumers(NotificationInterface.NotificationType.ENTITIES, 1);
</verbatim>
The consumer exposes the Iterator interface that should be used to get the entity notifications as they are posted. The hasNext() method blocks until a notification is available.
<verbatim>
while(consumer.hasNext()) {
EntityNotification notification = consumer.next();
IReferenceableInstance entity = notification.getEntity();
}
</verbatim>
---+ Notifications
---++ Notifications from Apache Atlas
Apache Atlas sends notifications about metadata changes to Kafka topic named ATLAS_ENTITIES .
Applications interested in metadata changes can monitor for these notifications.
For example, Apache Ranger processes these notifications to authorize data access based on classifications.
---+++ Notifications - V2: Apache Atlas version 1.0
Apache Atlas 1.0 sends notifications for following operations on metadata.
<verbatim>
ENTITY_CREATE: sent when an entity instance is created
ENTITY_UPDATE: sent when an entity instance is updated
ENTITY_DELETE: sent when an entity instance is deleted
CLASSIFICATION_ADD: sent when classifications are added to an entity instance
CLASSIFICATION_UPDATE: sent when classifications of an entity instance are updated
CLASSIFICATION_DELETE: sent when classifications are removed from an entity instance
</verbatim>
Notification includes the following data.
<verbatim>
AtlasEntity entity;
OperationType operationType;
List<AtlasClassification> classifications;
</verbatim>
---+++ Notifications - V1: Apache Atlas version 0.8.x and earlier
Notifications from Apache Atlas version 0.8.x and earlier have content formatted differently, as detailed below.
__Operations__
<verbatim>
ENTITY_CREATE: sent when an entity instance is created
ENTITY_UPDATE: sent when an entity instance is updated
ENTITY_DELETE: sent when an entity instance is deleted
TRAIT_ADD: sent when classifications are added to an entity instance
TRAIT_UPDATE: sent when classifications of an entity instance are updated
TRAIT_DELETE: sent when classifications are removed from an entity instance
</verbatim>
Notification includes the following data.
<verbatim>
Referenceable entity;
OperationType operationType;
List<Struct> traits;
</verbatim>
Apache Atlas 1.0 can be configured to send notifications in older version format, instead of the latest version format.
This can be helpful in deployments that are not yet ready to process notifications in latest version format.
To configure Apache Atlas 1.0 to send notifications in earlier version format, please set following configuration in
atlas-application.properties:
<verbatim>
atlas.notification.entity.version=v1
</verbatim>
---++ Notifications to Apache Atlas
Apache Atlas can be notified of metadata changes and lineage via notifications to Kafka topic named ATLAS_HOOK.
Atlas hooks for Apache Hive/Apache HBase/Apache Storm/Apache Sqoop use this mechanism to notify Apache Atlas of events of interest.
<verbatim>
ENTITY_CREATE : create an entity. For more details, refer to Java class HookNotificationV1.EntityCreateRequest
ENTITY_FULL_UPDATE : update an entity. For more details, refer to Java class HookNotificationV1.EntityUpdateRequest
ENTITY_PARTIAL_UPDATE : update specific attributes of an entity. For more details, refer to HookNotificationV1.EntityPartialUpdateRequest
ENTITY_DELETE : delete an entity. For more details, refer to Java class HookNotificationV1.EntityDeleteRequest
ENTITY_CREATE_V2 : create an entity. For more details, refer to Java class HookNotification.EntityCreateRequestV2
ENTITY_FULL_UPDATE_V2 : update an entity. For more details, refer to Java class HookNotification.EntityUpdateRequestV2
ENTITY_PARTIAL_UPDATE_V2 : update specific attributes of an entity. For more details, refer to HookNotification.EntityPartialUpdateRequestV2
ENTITY_DELETE_V2 : delete one or more entities. For more details, refer to Java class HookNotification.EntityDeleteRequestV2
</verbatim>
...@@ -7,114 +7,111 @@ The entire query structure can be represented using the following JSON structure ...@@ -7,114 +7,111 @@ The entire query structure can be represented using the following JSON structure
<verbatim> <verbatim>
{ {
"typeName": "hive_table", "typeName": "hive_column",
"excludeDeletedEntities": true, "excludeDeletedEntities": true,
"classification" : "", "classification": "PII",
"query": "", "query": "",
"limit": 25, "offset": 0,
"offset": 0, "limit": 25,
"entityFilters": { "entityFilters": { },
"attributeName": "name", "tagFilters": { },
"operator": "contains", "attributes": [ "table", "qualifiedName"]
"attributeValue": "testtable"
},
"tagFilters": null,
"attributes": [""]
} }
</verbatim> </verbatim>
__Field description__ __Field description__
* typeName: The type of entity to look for <verbatim>
* excludeDeletedEntities: Should the search include deleted entities too (default: true) typeName: the type of entity to look for
* classification: Only include entities with given Classification/tag excludeDeletedEntities: should the search exclude deleted entities? (default: true)
* query: Any free text occurrence that the entity should have (generic/wildcard queries might be slow) classification: only include entities with given classification
* limit: Max number of results to fetch query: any free text occurrence that the entity should have (generic/wildcard queries might be slow)
* offset: Starting offset of the result set (useful for pagination) offset: starting offset of the result set (useful for pagination)
* entityFilters: Entity Attribute filter(s) limit: max number of results to fetch
* tagFilters: Classification/tag Attribute filter(s) entityFilters: entity attribute filter(s)
* attributes: Attributes to include in the search result (default: include any attribute present in the filter) tagFilters: classification attribute filter(s)
attributes: attributes to include in the search result
</verbatim>
Attribute based filtering can be done on multiple attributes with AND/OR condition. <img src="images/twiki/search-basic-hive_column-PII.png" height="400" width="600"/>
*NOTE: The tagFilters and entityFilters field have same JSON structure.* Attribute based filtering can be done on multiple attributes with AND/OR conditions.
__Examples of filtering (for hive_table attributes)__ __Examples of filtering (for hive_table attributes)__
* Single attribute * Single attribute
<verbatim> <verbatim>
{ {
"typeName": "hive_table", "typeName": "hive_table",
"excludeDeletedEntities": true, "excludeDeletedEntities": true,
"classification" : "", "offset": 0,
"query": "", "limit": 25,
"limit": 50,
"offset": 0,
"entityFilters": { "entityFilters": {
"attributeName": "name", "attributeName": "name",
"operator": "contains", "operator": "contains",
"attributeValue": "testtable" "attributeValue": "customers"
}, },
"tagFilters": null, "attributes": [ "db", "qualifiedName" ]
"attributes": [""]
} }
</verbatim> </verbatim>
<img src="images/twiki/search-basic-hive_table-customers.png" height="400" width="600"/>
* Multi-attribute with OR * Multi-attribute with OR
<verbatim> <verbatim>
{ {
"typeName": "hive_table", "typeName": "hive_table",
"excludeDeletedEntities": true, "excludeDeletedEntities": true,
"classification" : "", "offset": 0,
"query": "", "limit": 25,
"limit": 50,
"offset": 0,
"entityFilters": { "entityFilters": {
"condition": "OR", "condition": "OR",
"criterion": [ "criterion": [
{ {
"attributeName": "name", "attributeName": "name",
"operator": "contains", "operator": "contains",
"attributeValue": "testtable" "attributeValue": "customers"
}, },
{ {
"attributeName": "owner", "attributeName": "name",
"operator": "eq", "operator": "contains",
"attributeValue": "admin" "attributeValue": "provider"
} }
] ]
}, },
"tagFilters": null, "attributes": [ "db", "qualifiedName" ]
"attributes": [""]
} }
</verbatim> </verbatim>
<img src="images/twiki/search-basic-hive_table-customers-or-provider.png" height="400" width="600"/>
* Multi-attribute with AND * Multi-attribute with AND
<verbatim> <verbatim>
{ {
"typeName": "hive_table", "typeName": "hive_table",
"excludeDeletedEntities": true, "excludeDeletedEntities": true,
"classification" : "", "offset": 0,
"query": "", "limit": 25,
"limit": 50,
"offset": 0,
"entityFilters": { "entityFilters": {
"condition": "AND", "condition": "AND",
"criterion": [ "criterion": [
{ {
"attributeName": "name", "attributeName": "name",
"operator": "contains", "operator": "contains",
"attributeValue": "testtable" "attributeValue": "customers"
}, },
{ {
"attributeName": "owner", "attributeName": "owner",
"operator": "eq", "operator": "eq",
"attributeValue": "admin" "attributeValue": "hive"
} }
] ]
}, },
"tagFilters": null, "attributes": [ "db", "qualifiedName" ]
"attributes": [""] }
}
</verbatim> </verbatim>
<img src="images/twiki/search-basic-hive_table-customers-owner_is_hive.png" height="400" width="600"/>
__Supported operators for filtering__ __Supported operators for filtering__
* LT (symbols: <, lt) works with Numeric, Date attributes * LT (symbols: <, lt) works with Numeric, Date attributes
...@@ -135,29 +132,28 @@ __CURL Samples__ ...@@ -135,29 +132,28 @@ __CURL Samples__
-u <user>:<password> -u <user>:<password>
-X POST -X POST
-d '{ -d '{
"typeName": "hive_table", "typeName": "hive_table",
"excludeDeletedEntities": true, "excludeDeletedEntities": true,
"classification" : "", "classification": "",
"query": "", "query": "",
"limit": 50, "offset": 0,
"offset": 0, "limit": 50,
"entityFilters": { "entityFilters": {
"condition": "AND", "condition": "AND",
"criterion": [ "criterion": [
{ {
"attributeName": "name", "attributeName": "name",
"operator": "contains", "operator": "contains",
"attributeValue": "testtable" "attributeValue": "customers"
}, },
{ {
"attributeName": "owner", "attributeName": "owner",
"operator": "eq", "operator": "eq",
"attributeValue": "admin" "attributeValue": "hive"
} }
] ]
}, },
"tagFilters": null, "attributes": [ "db", "qualifiedName" ]
"attributes": [""]
}' }'
<protocol>://<atlas_host>:<atlas_port>/api/atlas/v2/search/basic <protocol>://<atlas_host>:<atlas_port>/api/atlas/v2/search/basic
</verbatim> </verbatim>
...@@ -24,6 +24,7 @@ capabilities around these data assets for data scientists, analysts and the data ...@@ -24,6 +24,7 @@ capabilities around these data assets for data scientists, analysts and the data
* Ability to dynamically create classifications - like PII, EXPIRES_ON, DATA_QUALITY, SENSITIVE * Ability to dynamically create classifications - like PII, EXPIRES_ON, DATA_QUALITY, SENSITIVE
* Classifications can include attributes - like expiry_date attribute in EXPIRES_ON classification * Classifications can include attributes - like expiry_date attribute in EXPIRES_ON classification
* Entities can be associated with multiple classifications, enabling easier discovery and security enforcement * Entities can be associated with multiple classifications, enabling easier discovery and security enforcement
* Propagation of classifications via lineage - automatically ensures that classifications follow the data as it goes through various processing
---+++ Lineage ---+++ Lineage
* Intuitive UI to view lineage of data as it moves through various processes * Intuitive UI to view lineage of data as it moves through various processes
...@@ -35,7 +36,8 @@ capabilities around these data assets for data scientists, analysts and the data ...@@ -35,7 +36,8 @@ capabilities around these data assets for data scientists, analysts and the data
* SQL like query language to search entities - Domain Specific Language (DSL) * SQL like query language to search entities - Domain Specific Language (DSL)
---+++ Security & Data Masking ---+++ Security & Data Masking
* Integration with Apache Ranger enables authorization/data-masking based on classifications associated with entities in Apache Atlas. For example: * Fine grained security for metadata access, enabling controls on access to entity instances and operations like add/update/remove classifications
* Integration with Apache Ranger enables authorization/data-masking on data access based on classifications associated with entities in Apache Atlas. For example:
* who can access data classified as PII, SENSITIVE * who can access data classified as PII, SENSITIVE
* customer-service users can only see last 4 digits of columns classified as NATIONAL_ID * customer-service users can only see last 4 digits of columns classified as NATIONAL_ID
...@@ -50,20 +52,18 @@ capabilities around these data assets for data scientists, analysts and the data ...@@ -50,20 +52,18 @@ capabilities around these data assets for data scientists, analysts and the data
* [[Architecture][High Level Architecture]] * [[Architecture][High Level Architecture]]
* [[TypeSystem][Type System]] * [[TypeSystem][Type System]]
* [[Search - Basic][Basic Search]] * [[Search - Basic][Search: Basic]]
* [[Search - Advanced][Advanced Search]] * [[Search - Advanced][Search: Advanced]]
* [[security][Security]] * [[security][Security]]
* [[Authentication-Authorization][Authentication and Authorization]] * [[Authentication-Authorization][Authentication and Authorization]]
* [[Configuration][Configuration]] * [[Configuration][Configuration]]
* Notification * [[Notifications][Notifications]]
* [[Notification-Entity][Entity Notification]]
* Hooks & Bridges * Hooks & Bridges
* [[Bridge-HBase][HBase Hook & Bridge]] * [[Hook-HBase][HBase Hook & Bridge]]
* [[Bridge-Hive][Hive Hook & Bridge]] * [[Hook-Hive][Hive Hook & Bridge]]
* [[Hook-Sqoop][Sqoop Hook]]
* [[Hook-Storm][Storm Hook]]
* [[Bridge-Kafka][Kafka Bridge]] * [[Bridge-Kafka][Kafka Bridge]]
* [[Bridge-Sqoop][Sqoop Hook]]
* [[StormAtlasHook][Storm Hook]]
* [[Bridge-Falcon][Falcon Hook]]
* [[HighAvailability][Fault Tolerance And High Availability Options]] * [[HighAvailability][Fault Tolerance And High Availability Options]]
---++ API Documentation ---++ API Documentation
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment