Commit 5c2f7a0c by Madhan Neethiraj Committed by kevalbhatt

ATLAS-2365: updated README for 1.0.0-alpha release

Signed-off-by: 's avatarkevalbhatt <kbhatt@apache.org>
parent 39be2ccf
...@@ -15,6 +15,7 @@ ...@@ -15,6 +15,7 @@
# limitations under the License. # limitations under the License.
Apache Atlas Overview Apache Atlas Overview
=====================
Apache Atlas framework is an extensible set of core Apache Atlas framework is an extensible set of core
foundational governance services – enabling enterprises to effectively and foundational governance services – enabling enterprises to effectively and
...@@ -31,6 +32,16 @@ The metadata veracity is maintained by leveraging Apache Ranger to prevent ...@@ -31,6 +32,16 @@ The metadata veracity is maintained by leveraging Apache Ranger to prevent
non-authorized access paths to data at runtime. non-authorized access paths to data at runtime.
Security is both role based (RBAC) and attribute based (ABAC). Security is both role based (RBAC) and attribute based (ABAC).
Apache Atlas 1.0.0-alpha release
================================
Please note that this is an alpha/technical-preview release and is not
recommended for production use. There is no support for migration of data
from earlier version of Apache Atlas. Also, the data generated using this
alpha release may not migrate to Apache Atlas 1.0 GA release.
Build Process Build Process
============= =============
...@@ -51,14 +62,6 @@ Build Process ...@@ -51,14 +62,6 @@ Build Process
$ export MAVEN_OPTS="-Xms2g -Xmx2g" $ export MAVEN_OPTS="-Xms2g -Xmx2g"
$ mvn clean install $ mvn clean install
# currently few tests might fail in some environments
# (timing issue?), the community is reviewing and updating
# such tests.
#
# if you see test failures, please run the following command:
$ mvn clean -DskipTests install
$ mvn clean package -Pdist $ mvn clean package -Pdist
3. After above build commands successfully complete, you should see the following files 3. After above build commands successfully complete, you should see the following files
...@@ -68,3 +71,5 @@ Build Process ...@@ -68,3 +71,5 @@ Build Process
addons/hive-bridge/target/hive-bridge-<version>.jar addons/hive-bridge/target/hive-bridge-<version>.jar
addons/sqoop-bridge/target/sqoop-bridge-<version>.jar addons/sqoop-bridge/target/sqoop-bridge-<version>.jar
addons/storm-bridge/target/storm-bridge-<version>.jar addons/storm-bridge/target/storm-bridge-<version>.jar
4. For more details on building and running Apache Atlas, please refer to http://atlas.apache.org/InstallationSteps.html
...@@ -77,6 +77,9 @@ ...@@ -77,6 +77,9 @@
<version>1.6</version> <version>1.6</version>
</dependency> </dependency>
</dependencies> </dependencies>
<configuration>
<port>8080</port>
</configuration>
<executions> <executions>
<execution> <execution>
<goals> <goals>
......
...@@ -8,8 +8,7 @@ ...@@ -8,8 +8,7 @@
The components of Atlas can be grouped under the following major categories: The components of Atlas can be grouped under the following major categories:
---+++ Core ---+++ Core
Atlas core includes the following components:
This category contains the components that implement the core of Atlas functionality, including:
*Type System*: Atlas allows users to define a model for the metadata objects they want to manage. The model is composed *Type System*: Atlas allows users to define a model for the metadata objects they want to manage. The model is composed
of definitions called ‘types’. Instances of ‘types’ called ‘entities’ represent the actual metadata objects that are of definitions called ‘types’. Instances of ‘types’ called ‘entities’ represent the actual metadata objects that are
...@@ -21,25 +20,18 @@ One key point to note is that the generic nature of the modelling in Atlas allow ...@@ -21,25 +20,18 @@ One key point to note is that the generic nature of the modelling in Atlas allow
define both technical metadata and business metadata. It is also possible to define rich relationships between the define both technical metadata and business metadata. It is also possible to define rich relationships between the
two using features of Atlas. two using features of Atlas.
*Graph Engine*: Internally, Atlas persists metadata objects it manages using a Graph model. This approach provides great
flexibility and enables efficient handling of rich relationships between the metadata objects. Graph engine component is
responsible for translating between types and entities of the Atlas type system, and the underlying graph persistence model.
In addition to managing the graph objects, the graph engine also creates the appropriate indices for the metadata
objects so that they can be searched efficiently. Atlas uses the JanusGraph to store the metadata objects.
*Ingest / Export*: The Ingest component allows metadata to be added to Atlas. Similarly, the Export component exposes *Ingest / Export*: The Ingest component allows metadata to be added to Atlas. Similarly, the Export component exposes
metadata changes detected by Atlas to be raised as events. Consumers can consume these change events to react to metadata changes detected by Atlas to be raised as events. Consumers can consume these change events to react to
metadata changes in real time. metadata changes in real time.
*Graph Engine*: Internally, Atlas represents metadata objects it manages using a Graph model. It does this to
achieve great flexibility and rich relations between the metadata objects. The Graph Engine is a component that is
responsible for translating between types and entities of the Type System, and the underlying Graph model.
In addition to managing the Graph objects, The Graph Engine also creates the appropriate indices for the metadata
objects so that they can be searched for efficiently.
*Titan*: Currently, Atlas uses the Titan Graph Database to store the metadata objects. Titan is used as a library
within Atlas. Titan uses two stores: The Metadata store is configured to !HBase by default and the Index store
is configured to Solr. It is also possible to use the Metadata store as BerkeleyDB and Index store as !ElasticSearch
by building with corresponding profiles. The Metadata store is used for storing the metadata objects proper, and the
Index store is used for storing indices of the Metadata properties, that allows efficient search.
---+++ Integration ---+++ Integration
Users can manage metadata in Atlas using two methods: Users can manage metadata in Atlas using two methods:
*API*: All functionality of Atlas is exposed to end users via a REST API that allows types and entities to be created, *API*: All functionality of Atlas is exposed to end users via a REST API that allows types and entities to be created,
...@@ -53,7 +45,6 @@ uses Apache Kafka as a notification server for communication between hooks and d ...@@ -53,7 +45,6 @@ uses Apache Kafka as a notification server for communication between hooks and d
notification events. Events are written by the hooks and Atlas to different Kafka topics. notification events. Events are written by the hooks and Atlas to different Kafka topics.
---+++ Metadata sources ---+++ Metadata sources
Atlas supports integration with many sources of metadata out of the box. More integrations will be added in future Atlas supports integration with many sources of metadata out of the box. More integrations will be added in future
as well. Currently, Atlas supports ingesting and managing metadata from the following sources: as well. Currently, Atlas supports ingesting and managing metadata from the following sources:
...@@ -61,6 +52,7 @@ as well. Currently, Atlas supports ingesting and managing metadata from the foll ...@@ -61,6 +52,7 @@ as well. Currently, Atlas supports ingesting and managing metadata from the foll
* [[Bridge-Sqoop][Sqoop]] * [[Bridge-Sqoop][Sqoop]]
* [[Bridge-Falcon][Falcon]] * [[Bridge-Falcon][Falcon]]
* [[StormAtlasHook][Storm]] * [[StormAtlasHook][Storm]]
* HBase - _documentation work-in-progress_
The integration implies two things: The integration implies two things:
There are metadata models that Atlas defines natively to represent objects of these components. There are metadata models that Atlas defines natively to represent objects of these components.
...@@ -80,12 +72,6 @@ for the Hadoop ecosystem having wide integration with a variety of Hadoop compon ...@@ -80,12 +72,6 @@ for the Hadoop ecosystem having wide integration with a variety of Hadoop compon
Ranger allows security administrators to define metadata driven security policies for effective governance. Ranger allows security administrators to define metadata driven security policies for effective governance.
Ranger is a consumer to the metadata change events notified by Atlas. Ranger is a consumer to the metadata change events notified by Atlas.
*Business Taxonomy*: The metadata objects ingested into Atlas from the Metadata sources are primarily a form
of technical metadata. To enhance the discoverability and governance capabilities, Atlas comes with a Business
Taxonomy interface that allows users to first, define a hierarchical set of business terms that represent their
business domain and associate them to the metadata entities Atlas manages. Business Taxonomy is a web application that
is part of the Atlas Admin UI currently and integrates with Atlas using the REST API.
......
---+ Falcon Atlas Bridge ---+ Falcon Atlas Bridge
---++ Falcon Model ---++ Falcon Model
The default falcon modelling is available in org.apache.atlas.falcon.model.FalconDataModelGenerator. It defines the following types: The default hive model includes the following types:
<verbatim> * Entity types:
falcon_cluster(ClassType) - super types [Infrastructure] - attributes [timestamp, colo, owner, tags] * falcon_cluster
falcon_feed(ClassType) - super types [DataSet] - attributes [timestamp, stored-in, owner, groups, tags] * super-types: Infrastructure
falcon_feed_creation(ClassType) - super types [Process] - attributes [timestamp, stored-in, owner] * attributes: timestamp, colo, owner, tags
falcon_feed_replication(ClassType) - super types [Process] - attributes [timestamp, owner] * falcon_feed
falcon_process(ClassType) - super types [Process] - attributes [timestamp, runs-on, owner, tags, pipelines, workflow-properties] * super-types: !DataSet
</verbatim> * attributes: timestamp, stored-in, owner, groups, tags
* falcon_feed_creation
* super-types: Process
* attributes: timestamp, stored-in, owner
* falcon_feed_replication
* super-types: Process
* attributes: timestamp, owner
* falcon_process
* super-types: Process
* attributes: timestamp, runs-on, owner, tags, pipelines, workflow-properties
One falcon_process entity is created for every cluster that the falcon process is defined for. One falcon_process entity is created for every cluster that the falcon process is defined for.
The entities are created and de-duped using unique qualifiedName attribute. They provide namespace and can be used for querying/lineage as well. The unique attributes are: The entities are created and de-duped using unique qualifiedName attribute. They provide namespace and can be used for querying/lineage as well. The unique attributes are:
* falcon_process - <process name>@<cluster name> * falcon_process.qualifiedName - <process name>@<cluster name>
* falcon_cluster - <cluster name> * falcon_cluster.qualifiedName - <cluster name>
* falcon_feed - <feed name>@<cluster name> * falcon_feed.qualifiedName - <feed name>@<cluster name>
* falcon_feed_creation - <feed name> * falcon_feed_creation.qualifiedName - <feed name>
* falcon_feed_replication - <feed name> * falcon_feed_replication.qualifiedName - <feed name>
---++ Falcon Hook ---++ Falcon Hook
Falcon supports listeners on falcon entity submission. This is used to add entities in Atlas using the model defined in org.apache.atlas.falcon.model.FalconDataModelGenerator. Falcon supports listeners on falcon entity submission. This is used to add entities in Atlas using the model detailed above.
The hook submits the request to a thread pool executor to avoid blocking the command execution. The thread submits the entities as message to the notification server and atlas server reads these messages and registers the entities. Follow the instructions below to setup Atlas hook in Falcon:
* Add 'org.apache.atlas.falcon.service.AtlasService' to application.services in <falcon-conf>/startup.properties * Add 'org.apache.atlas.falcon.service.AtlasService' to application.services in <falcon-conf>/startup.properties
* Link falcon hook jars in falcon classpath - 'ln -s <atlas-home>/hook/falcon/* <falcon-home>/server/webapp/falcon/WEB-INF/lib/' * Link Atlas hook jars in Falcon classpath - 'ln -s <atlas-home>/hook/falcon/* <falcon-home>/server/webapp/falcon/WEB-INF/lib/'
* In <falcon_conf>/falcon-env.sh, set an environment variable as follows: * In <falcon_conf>/falcon-env.sh, set an environment variable as follows:
<verbatim> <verbatim>
export FALCON_SERVER_OPTS="<atlas_home>/hook/falcon/*:$FALCON_SERVER_OPTS" export FALCON_SERVER_OPTS="<atlas_home>/hook/falcon/*:$FALCON_SERVER_OPTS"</verbatim>
</verbatim>
The following properties in <atlas-conf>/atlas-application.properties control the thread pool and notification details: The following properties in <atlas-conf>/atlas-application.properties control the thread pool and notification details:
* atlas.hook.falcon.synchronous - boolean, true to run the hook synchronously. default false * atlas.hook.falcon.synchronous - boolean, true to run the hook synchronously. default false
...@@ -40,5 +48,5 @@ The following properties in <atlas-conf>/atlas-application.properties control th ...@@ -40,5 +48,5 @@ The following properties in <atlas-conf>/atlas-application.properties control th
Refer [[Configuration][Configuration]] for notification related configurations Refer [[Configuration][Configuration]] for notification related configurations
---++ Limitations ---++ NOTES
* In falcon cluster entity, cluster name used should be uniform across components like hive, falcon, sqoop etc. If used with ambari, ambari cluster name should be used for cluster entity * In falcon cluster entity, cluster name used should be uniform across components like hive, falcon, sqoop etc. If used with ambari, ambari cluster name should be used for cluster entity
---+ Hive Atlas Bridge ---+ Hive Atlas Bridge
---++ Hive Model ---++ Hive Model
The default hive modelling is available in org.apache.atlas.hive.model.HiveDataModelGenerator. It defines the following types: The default hive model includes the following types:
<verbatim> * Entity types:
hive_db(ClassType) - super types [Referenceable] - attributes [name, clusterName, description, locationUri, parameters, ownerName, ownerType] * hive_db
hive_storagedesc(ClassType) - super types [Referenceable] - attributes [cols, location, inputFormat, outputFormat, compressed, numBuckets, serdeInfo, bucketCols, sortCols, parameters, storedAsSubDirectories] * super-types: Referenceable
hive_column(ClassType) - super types [Referenceable] - attributes [name, type, comment, table] * attributes: name, clusterName, description, locationUri, parameters, ownerName, ownerType
hive_table(ClassType) - super types [DataSet] - attributes [name, db, owner, createTime, lastAccessTime, comment, retention, sd, partitionKeys, columns, aliases, parameters, viewOriginalText, viewExpandedText, tableType, temporary] * hive_storagedesc
hive_process(ClassType) - super types [Process] - attributes [name, startTime, endTime, userName, operationType, queryText, queryPlan, queryId] * super-types: Referenceable
hive_principal_type(EnumType) - values [USER, ROLE, GROUP] * attributes: cols, location, inputFormat, outputFormat, compressed, numBuckets, serdeInfo, bucketCols, sortCols, parameters, storedAsSubDirectories
hive_order(StructType) - attributes [col, order] * hive_column
hive_serde(StructType) - attributes [name, serializationLib, parameters] * super-types: Referenceable
</verbatim> * attributes: name, type, comment, table
* hive_table
* super-types: !DataSet
* attributes: name, db, owner, createTime, lastAccessTime, comment, retention, sd, partitionKeys, columns, aliases, parameters, viewOriginalText, viewExpandedText, tableType, temporary
* hive_process
* super-types: Process
* attributes: name, startTime, endTime, userName, operationType, queryText, queryPlan, queryId
* hive_column_lineage
* super-types: Process
* attributes: query, depenendencyType, expression
* Enum types:
* hive_principal_type
* values: USER, ROLE, GROUP
* Struct types:
* hive_order
* attributes: col, order
* hive_serde
* attributes: name, serializationLib, parameters
The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying/lineage as well. Note that dbName, tableName and columnName should be in lower case. clusterName is explained below. The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying/lineage as well. Note that dbName, tableName and columnName should be in lower case. clusterName is explained below.
* hive_db - attribute qualifiedName - <dbName>@<clusterName> * hive_db.qualifiedName - <dbName>@<clusterName>
* hive_table - attribute qualifiedName - <dbName>.<tableName>@<clusterName> * hive_table.qualifiedName - <dbName>.<tableName>@<clusterName>
* hive_column - attribute qualifiedName - <dbName>.<tableName>.<columnName>@<clusterName> * hive_column.qualifiedName - <dbName>.<tableName>.<columnName>@<clusterName>
* hive_process - attribute name - <queryString> - trimmed query string in lower case * hive_process.queryString - trimmed query string in lower case
---++ Importing Hive Metadata ---++ Importing Hive Metadata
org.apache.atlas.hive.bridge.HiveMetaStoreBridge imports the Hive metadata into Atlas using the model defined in org.apache.atlas.hive.model.HiveDataModelGenerator. import-hive.sh command can be used to facilitate this. The script needs Hadoop and Hive classpath jars. org.apache.atlas.hive.bridge.HiveMetaStoreBridge imports the Hive metadata into Atlas using the model defined above. import-hive.sh command can be used to facilitate this.
* For Hadoop jars, please make sure that the environment variable HADOOP_CLASSPATH is set. Another way is to set HADOOP_HOME to point to root directory of your Hadoop installation
* Similarly, for Hive jars, set HIVE_HOME to the root of Hive installation
* Set environment variable HIVE_CONF_DIR to Hive configuration directory
* Copy <atlas-conf>/atlas-application.properties to the hive conf directory
<verbatim> <verbatim>
Usage: <atlas package>/hook-bin/import-hive.sh Usage: <atlas package>/hook-bin/import-hive.sh</verbatim>
</verbatim>
The logs are in <atlas package>/logs/import-hive.log The logs are in <atlas package>/logs/import-hive.log
If you you are importing metadata in a kerberized cluster you need to run the command like this:
<verbatim>
<atlas package>/hook-bin/import-hive.sh -Dsun.security.jgss.debug=true -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.krb5.conf=[krb5.conf location] -Djava.security.auth.login.config=[jaas.conf location]
</verbatim>
* krb5.conf is typically found at /etc/krb5.conf
* for details about jaas.conf and a suggested location see the [[security][atlas security documentation]]
---++ Hive Hook ---++ Hive Hook
Hive supports listeners on hive command execution using hive hooks. This is used to add/update/remove entities in Atlas using the model defined in org.apache.atlas.hive.model.HiveDataModelGenerator. Atlas Hive hook registers with Hive to listen for create/update/delete operations and updates the metadata in Atlas, via Kafka notifications, for the changes in Hive.
The hook submits the request to a thread pool executor to avoid blocking the command execution. The thread submits the entities as message to the notification server and atlas server reads these messages and registers the entities. Follow the instructions below to setup Atlas hook in Hive:
Follow these instructions in your hive set-up to add hive hook for Atlas: * Set-up Atlas hook in hive-site.xml by adding the following:
* Set-up atlas hook in hive-site.xml of your hive configuration:
<verbatim> <verbatim>
<property> <property>
<name>hive.exec.post.hooks</name> <name>hive.exec.post.hooks</name>
<value>org.apache.atlas.hive.hook.HiveHook</value> <value>org.apache.atlas.hive.hook.HiveHook</value>
</property> </property></verbatim>
</verbatim>
<verbatim>
<property>
<name>atlas.cluster.name</name>
<value>primary</value>
</property>
</verbatim>
* Add 'export HIVE_AUX_JARS_PATH=<atlas package>/hook/hive' in hive-env.sh of your hive configuration * Add 'export HIVE_AUX_JARS_PATH=<atlas package>/hook/hive' in hive-env.sh of your hive configuration
* Copy <atlas-conf>/atlas-application.properties to the hive conf directory. * Copy <atlas-conf>/atlas-application.properties to the hive conf directory.
The following properties in <atlas-conf>/atlas-application.properties control the thread pool and notification details: The following properties in <atlas-conf>/atlas-application.properties control the thread pool and notification details:
* atlas.hook.hive.synchronous - boolean, true to run the hook synchronously. default false. Recommended to be set to false to avoid delays in hive query completion. * atlas.hook.hive.synchronous - boolean, true to run the hook synchronously. default false. Recommended to be set to false to avoid delays in hive query completion.
* atlas.hook.hive.numRetries - number of retries for notification failure. default 3 * atlas.hook.hive.numRetries - number of retries for notification failure. default 3
* atlas.hook.hive.minThreads - core number of threads. default 5 * atlas.hook.hive.minThreads - core number of threads. default 1
* atlas.hook.hive.maxThreads - maximum number of threads. default 5 * atlas.hook.hive.maxThreads - maximum number of threads. default 5
* atlas.hook.hive.keepAliveTime - keep alive time in msecs. default 10 * atlas.hook.hive.keepAliveTime - keep alive time in msecs. default 10
* atlas.hook.hive.queueSize - queue size for the threadpool. default 10000 * atlas.hook.hive.queueSize - queue size for the threadpool. default 10000
...@@ -76,24 +74,23 @@ Refer [[Configuration][Configuration]] for notification related configurations ...@@ -76,24 +74,23 @@ Refer [[Configuration][Configuration]] for notification related configurations
Starting from 0.8-incubating version of Atlas, Column level lineage is captured in Atlas. Below are the details Starting from 0.8-incubating version of Atlas, Column level lineage is captured in Atlas. Below are the details
---+++ Model ---+++ Model
* !ColumnLineageProcess type is a subclass of Process * !ColumnLineageProcess type is a subtype of Process
* This relates an output Column to a set of input Columns or the Input Table * This relates an output Column to a set of input Columns or the Input Table
* The Lineage also captures the kind of Dependency: currently the values are SIMPLE, EXPRESSION, SCRIPT * The lineage also captures the kind of dependency, as listed below:
* A SIMPLE dependency means the output column has the same value as the input * SIMPLE: output column has the same value as the input
* An EXPRESSION dependency means the output column is transformed by some expression in the runtime(for e.g. a Hive SQL expression) on the Input Columns. * EXPRESSION: output column is transformed by some expression at runtime (for e.g. a Hive SQL expression) on the Input Columns.
* SCRIPT means that the output column is transformed by a user provided script. * SCRIPT: output column is transformed by a user provided script.
* In case of EXPRESSION dependency the expression attribute contains the expression in string form * In case of EXPRESSION dependency the expression attribute contains the expression in string form
* Since Process links input and output !DataSets, we make Column a subclass of !DataSet * Since Process links input and output !DataSets, Column is a subtype of !DataSet
---+++ Examples ---+++ Examples
For a simple CTAS below: For a simple CTAS below:
<verbatim> <verbatim>
create table t2 as select id, name from T1 create table t2 as select id, name from T1</verbatim>
</verbatim>
The lineage is captured as The lineage is captured as
...@@ -106,10 +103,8 @@ The lineage is captured as ...@@ -106,10 +103,8 @@ The lineage is captured as
* The !LineageInfo in Hive provides column-level lineage for the final !FileSinkOperator, linking them to the input columns in the Hive Query * The !LineageInfo in Hive provides column-level lineage for the final !FileSinkOperator, linking them to the input columns in the Hive Query
---+++ NOTE ---++ NOTES
Column level lineage works with Hive version 1.2.1 after the patch for <a href="https://issues.apache.org/jira/browse/HIVE-13112">HIVE-13112</a> is applied to Hive source * Column level lineage works with Hive version 1.2.1 after the patch for <a href="https://issues.apache.org/jira/browse/HIVE-13112">HIVE-13112</a> is applied to Hive source
---++ Limitations
* Since database name, table name and column names are case insensitive in hive, the corresponding names in entities are lowercase. So, any search APIs should use lowercase while querying on the entity names * Since database name, table name and column names are case insensitive in hive, the corresponding names in entities are lowercase. So, any search APIs should use lowercase while querying on the entity names
* The following hive operations are captured by hive hook currently * The following hive operations are captured by hive hook currently
* create database * create database
......
---+ Sqoop Atlas Bridge ---+ Sqoop Atlas Bridge
---++ Sqoop Model ---++ Sqoop Model
The default Sqoop modelling is available in org.apache.atlas.sqoop.model.SqoopDataModelGenerator. It defines the following types: The default hive model includes the following types:
<verbatim> * Entity types:
sqoop_operation_type(EnumType) - values [IMPORT, EXPORT, EVAL] * sqoop_process
sqoop_dbstore_usage(EnumType) - values [TABLE, QUERY, PROCEDURE, OTHER] * super-types: Process
sqoop_process(ClassType) - super types [Process] - attributes [name, operation, dbStore, hiveTable, commandlineOpts, startTime, endTime, userName] * attributes: name, operation, dbStore, hiveTable, commandlineOpts, startTime, endTime, userName
sqoop_dbdatastore(ClassType) - super types [DataSet] - attributes [name, dbStoreType, storeUse, storeUri, source, description, ownerName] * sqoop_dbdatastore
</verbatim> * super-types: !DataSet
* attributes: name, dbStoreType, storeUse, storeUri, source, description, ownerName
* Enum types:
* sqoop_operation_type
* values: IMPORT, EXPORT, EVAL
* sqoop_dbstore_usage
* values: TABLE, QUERY, PROCEDURE, OTHER
The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying as well: The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying as well:
sqoop_process - attribute name - sqoop-dbStoreType-storeUri-endTime * sqoop_process.qualifiedName - dbStoreType-storeUri-endTime
sqoop_dbdatastore - attribute name - dbStoreType-connectorUrl-source * sqoop_dbdatastore.qualifiedName - dbStoreType-storeUri-source
---++ Sqoop Hook ---++ Sqoop Hook
Sqoop added a !SqoopJobDataPublisher that publishes data to Atlas after completion of import Job. Today, only hiveImport is supported in sqoopHook. Sqoop added a !SqoopJobDataPublisher that publishes data to Atlas after completion of import Job. Today, only hiveImport is supported in !SqoopHook.
This is used to add entities in Atlas using the model defined in org.apache.atlas.sqoop.model.SqoopDataModelGenerator. This is used to add entities in Atlas using the model detailed above.
Follow these instructions in your sqoop set-up to add sqoop hook for Atlas in <sqoop-conf>/sqoop-site.xml:
Follow the instructions below to setup Atlas hook in Hive:
* Sqoop Job publisher class. Currently only one publishing class is supported Add the following properties to to enable Atlas hook in Sqoop:
* Set-up Atlas hook in <sqoop-conf>/sqoop-site.xml by adding the following:
<verbatim>
<property> <property>
<name>sqoop.job.data.publish.class</name> <name>sqoop.job.data.publish.class</name>
<value>org.apache.atlas.sqoop.hook.SqoopHook</value> <value>org.apache.atlas.sqoop.hook.SqoopHook</value>
</property> </property></verbatim>
* Atlas cluster name
<property>
<name>atlas.cluster.name</name>
<value><clustername></value>
</property>
* Copy <atlas-conf>/atlas-application.properties to to the sqoop conf directory <sqoop-conf>/ * Copy <atlas-conf>/atlas-application.properties to to the sqoop conf directory <sqoop-conf>/
* Link <atlas-home>/hook/sqoop/*.jar in sqoop lib * Link <atlas-home>/hook/sqoop/*.jar in sqoop lib
Refer [[Configuration][Configuration]] for notification related configurations Refer [[Configuration][Configuration]] for notification related configurations
---++ Limitations ---++ NOTES
* Only the following sqoop operations are captured by sqoop hook currently - hiveImport * Only the following sqoop operations are captured by sqoop hook currently - hiveImport
...@@ -157,9 +157,9 @@ At a high level the following points can be called out: ...@@ -157,9 +157,9 @@ At a high level the following points can be called out:
---++ Metadata Store ---++ Metadata Store
As described above, Atlas uses Titan to store the metadata it manages. By default, Atlas uses a standalone HBase As described above, Atlas uses JanusGraph to store the metadata it manages. By default, Atlas uses a standalone HBase
instance as the backing store for Titan. In order to provide HA for the metadata store, we recommend that Atlas be instance as the backing store for JanusGraph. In order to provide HA for the metadata store, we recommend that Atlas be
configured to use distributed HBase as the backing store for Titan. Doing this implies that you could benefit from the configured to use distributed HBase as the backing store for JanusGraph. Doing this implies that you could benefit from the
HA guarantees HBase provides. In order to configure Atlas to use HBase in HA mode, do the following: HA guarantees HBase provides. In order to configure Atlas to use HBase in HA mode, do the following:
* Choose an existing HBase cluster that is set up in HA mode to configure in Atlas (OR) Set up a new HBase cluster in [[http://hbase.apache.org/book.html#quickstart_fully_distributed][HA mode]]. * Choose an existing HBase cluster that is set up in HA mode to configure in Atlas (OR) Set up a new HBase cluster in [[http://hbase.apache.org/book.html#quickstart_fully_distributed][HA mode]].
...@@ -169,8 +169,8 @@ HA guarantees HBase provides. In order to configure Atlas to use HBase in HA mod ...@@ -169,8 +169,8 @@ HA guarantees HBase provides. In order to configure Atlas to use HBase in HA mod
---++ Index Store ---++ Index Store
As described above, Atlas indexes metadata through Titan to support full text search queries. In order to provide HA As described above, Atlas indexes metadata through JanusGraph to support full text search queries. In order to provide HA
for the index store, we recommend that Atlas be configured to use Solr as the backing index store for Titan. In order for the index store, we recommend that Atlas be configured to use Solr as the backing index store for JanusGraph. In order
to configure Atlas to use Solr in HA mode, do the following: to configure Atlas to use Solr in HA mode, do the following:
* Choose an existing !SolrCloud cluster setup in HA mode to configure in Atlas (OR) Set up a new [[https://cwiki.apache.org/confluence/display/solr/SolrCloud][SolrCloud cluster]]. * Choose an existing !SolrCloud cluster setup in HA mode to configure in Atlas (OR) Set up a new [[https://cwiki.apache.org/confluence/display/solr/SolrCloud][SolrCloud cluster]].
...@@ -208,4 +208,4 @@ to configure Atlas to use Kafka in HA mode, do the following: ...@@ -208,4 +208,4 @@ to configure Atlas to use Kafka in HA mode, do the following:
---++ Known Issues ---++ Known Issues
* If the HBase region servers hosting the Atlas ‘titan’ HTable are down, Atlas would not be able to store or retrieve metadata from HBase until they are brought back online. * If the HBase region servers hosting the Atlas table are down, Atlas would not be able to store or retrieve metadata from HBase until they are brought back online.
\ No newline at end of file \ No newline at end of file
---+ Quick Start Guide ---+ Quick Start
---++ Introduction ---++ Introduction
This quick start user guide is a simple client that adds a few sample type definitions modeled Quick start is a simple client that adds a few sample type definitions modeled after the example shown below.
after the example as shown below. It also adds example entities along with traits as shown in the It also adds sample entities along with traits as shown in the instance graph below.
instance graph below.
---+++ Example Type Definitions ---+++ Example Type Definitions
......
---+ Repository
---++ Introduction
...@@ -7,39 +7,49 @@ Atlas is a scalable and extensible set of core foundational governance services ...@@ -7,39 +7,49 @@ Atlas is a scalable and extensible set of core foundational governance services
enterprises to effectively and efficiently meet their compliance requirements within Hadoop and enterprises to effectively and efficiently meet their compliance requirements within Hadoop and
allows integration with the whole enterprise data ecosystem. allows integration with the whole enterprise data ecosystem.
Apache Atlas provides open metadata management and governance capabilities for organizations
to build a catalog of their data assets, classify and govern these assets and provide collaboration
capabilities around these data assets for data scientists, analysts and the data governance team.
---++ Features ---++ Features
---+++ Data Classification ---+++ Metadata types & instances
* Import or define taxonomy business-oriented annotations for data * Pre-defined types for various Hadoop and non-Hadoop metadata
* Define, annotate, and automate capture of relationships between data sets and underlying elements including source, target, and derivation processes * Ability to define new types for the metadata to be managed
* Export metadata to third-party systems * Types can have primitive attributes, complex attributes, object references; can inherit from other types
* Instances of types, called entities, capture metadata object details and their relationships
* REST APIs to work with types and instances allow easier integration
---+++ Classification
* Ability to dynamically create classifications - like PII, EXPIRES_ON, DATA_QUALITY, SENSITIVE
* Classifications can include attributes - like expiry_date attribute in EXPIRES_ON classification
* Entities can be associated with multiple classifications, enabling easier discovery and security enforcement
---+++ Centralized Auditing ---+++ Lineage
* Capture security access information for every application, process, and interaction with data * Intuitive UI to view lineage of data as it moves through various processes
* Capture the operational information for execution, steps, and activities * REST APIs to access and update lineage
---+++ Search & Lineage (Browse) ---+++ Search/Discovery
* Pre-defined navigation paths to explore the data classification and audit information * Intuitive UI to search entities by type, classification, attribute value or free-text
* Text-based search features locates relevant data and audit event across Data Lake quickly and accurately * Rich REST APIs to search by complex criteria
* Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information * SQL like query language to search entities - Domain Specific Language (DSL)
---+++ Security & Policy Engine ---+++ Security & Data Masking
* Rationalize compliance policy at runtime based on data classification schemes, attributes and roles. * Integration with Apache Ranger enables authorization/data-masking based on classifications associated with entities in Apache Atlas. For example:
* Advanced definition of policies for preventing data derivation based on classification (i.e. re-identification) – Prohibitions * who can access data classified as PII, SENSITIVE
* Column and Row level masking based on cell values and attibutes. * customer-service users can only see last 4 digits of columns classified as NATIONAL_ID
---++ Getting Started ---++ Getting Started
* [[InstallationSteps][Install Steps]] * [[InstallationSteps][Build & Install]]
* [[QuickStart][Quick Start Guide]] * [[QuickStart][Quick Start]]
---++ Documentation ---++ Documentation
* [[Architecture][High Level Architecture]] * [[Architecture][High Level Architecture]]
* [[TypeSystem][Type System]] * [[TypeSystem][Type System]]
* [[Repository][Metadata Repository]]
* [[Search][Search]] * [[Search][Search]]
* [[security][Security]] * [[security][Security]]
* [[Authentication-Authorization][Authentication and Authorization]] * [[Authentication-Authorization][Authentication and Authorization]]
......
...@@ -43,7 +43,7 @@ The properties for configuring service authentication are: ...@@ -43,7 +43,7 @@ The properties for configuring service authentication are:
* <code>atlas.authentication.keytab</code> - the path to the keytab file. * <code>atlas.authentication.keytab</code> - the path to the keytab file.
* <code>atlas.authentication.principal</code> - the principal to use for authenticating to the KDC. The principal is generally of the form "user/host@realm". You may use the '_HOST' token for the hostname and the local hostname will be substituted in by the runtime (e.g. "Atlas/_HOST@EXAMPLE.COM"). * <code>atlas.authentication.principal</code> - the principal to use for authenticating to the KDC. The principal is generally of the form "user/host@realm". You may use the '_HOST' token for the hostname and the local hostname will be substituted in by the runtime (e.g. "Atlas/_HOST@EXAMPLE.COM").
Note that when Atlas is configured with HBase as the storage backend in a secure cluster, the graph db (titan) needs sufficient user permissions to be able to create and access an HBase table. To grant the appropriate permissions see [[Configuration][Graph persistence engine - Hbase]]. Note that when Atlas is configured with HBase as the storage backend in a secure cluster, the graph db (JanusGraph) needs sufficient user permissions to be able to create and access an HBase table. To grant the appropriate permissions see [[Configuration][Graph persistence engine - Hbase]].
---+++ JAAS configuration ---+++ JAAS configuration
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment