Commit 5c2f7a0c by Madhan Neethiraj Committed by kevalbhatt

ATLAS-2365: updated README for 1.0.0-alpha release

Signed-off-by: 's avatarkevalbhatt <kbhatt@apache.org>
parent 39be2ccf
......@@ -15,6 +15,7 @@
# limitations under the License.
Apache Atlas Overview
=====================
Apache Atlas framework is an extensible set of core
foundational governance services – enabling enterprises to effectively and
......@@ -31,6 +32,16 @@ The metadata veracity is maintained by leveraging Apache Ranger to prevent
non-authorized access paths to data at runtime.
Security is both role based (RBAC) and attribute based (ABAC).
Apache Atlas 1.0.0-alpha release
================================
Please note that this is an alpha/technical-preview release and is not
recommended for production use. There is no support for migration of data
from earlier version of Apache Atlas. Also, the data generated using this
alpha release may not migrate to Apache Atlas 1.0 GA release.
Build Process
=============
......@@ -51,14 +62,6 @@ Build Process
$ export MAVEN_OPTS="-Xms2g -Xmx2g"
$ mvn clean install
# currently few tests might fail in some environments
# (timing issue?), the community is reviewing and updating
# such tests.
#
# if you see test failures, please run the following command:
$ mvn clean -DskipTests install
$ mvn clean package -Pdist
3. After above build commands successfully complete, you should see the following files
......@@ -68,3 +71,5 @@ Build Process
addons/hive-bridge/target/hive-bridge-<version>.jar
addons/sqoop-bridge/target/sqoop-bridge-<version>.jar
addons/storm-bridge/target/storm-bridge-<version>.jar
4. For more details on building and running Apache Atlas, please refer to http://atlas.apache.org/InstallationSteps.html
......@@ -77,6 +77,9 @@
<version>1.6</version>
</dependency>
</dependencies>
<configuration>
<port>8080</port>
</configuration>
<executions>
<execution>
<goals>
......
......@@ -8,8 +8,7 @@
The components of Atlas can be grouped under the following major categories:
---+++ Core
This category contains the components that implement the core of Atlas functionality, including:
Atlas core includes the following components:
*Type System*: Atlas allows users to define a model for the metadata objects they want to manage. The model is composed
of definitions called ‘types’. Instances of ‘types’ called ‘entities’ represent the actual metadata objects that are
......@@ -21,25 +20,18 @@ One key point to note is that the generic nature of the modelling in Atlas allow
define both technical metadata and business metadata. It is also possible to define rich relationships between the
two using features of Atlas.
*Graph Engine*: Internally, Atlas persists metadata objects it manages using a Graph model. This approach provides great
flexibility and enables efficient handling of rich relationships between the metadata objects. Graph engine component is
responsible for translating between types and entities of the Atlas type system, and the underlying graph persistence model.
In addition to managing the graph objects, the graph engine also creates the appropriate indices for the metadata
objects so that they can be searched efficiently. Atlas uses the JanusGraph to store the metadata objects.
*Ingest / Export*: The Ingest component allows metadata to be added to Atlas. Similarly, the Export component exposes
metadata changes detected by Atlas to be raised as events. Consumers can consume these change events to react to
metadata changes in real time.
*Graph Engine*: Internally, Atlas represents metadata objects it manages using a Graph model. It does this to
achieve great flexibility and rich relations between the metadata objects. The Graph Engine is a component that is
responsible for translating between types and entities of the Type System, and the underlying Graph model.
In addition to managing the Graph objects, The Graph Engine also creates the appropriate indices for the metadata
objects so that they can be searched for efficiently.
*Titan*: Currently, Atlas uses the Titan Graph Database to store the metadata objects. Titan is used as a library
within Atlas. Titan uses two stores: The Metadata store is configured to !HBase by default and the Index store
is configured to Solr. It is also possible to use the Metadata store as BerkeleyDB and Index store as !ElasticSearch
by building with corresponding profiles. The Metadata store is used for storing the metadata objects proper, and the
Index store is used for storing indices of the Metadata properties, that allows efficient search.
---+++ Integration
Users can manage metadata in Atlas using two methods:
*API*: All functionality of Atlas is exposed to end users via a REST API that allows types and entities to be created,
......@@ -53,7 +45,6 @@ uses Apache Kafka as a notification server for communication between hooks and d
notification events. Events are written by the hooks and Atlas to different Kafka topics.
---+++ Metadata sources
Atlas supports integration with many sources of metadata out of the box. More integrations will be added in future
as well. Currently, Atlas supports ingesting and managing metadata from the following sources:
......@@ -61,6 +52,7 @@ as well. Currently, Atlas supports ingesting and managing metadata from the foll
* [[Bridge-Sqoop][Sqoop]]
* [[Bridge-Falcon][Falcon]]
* [[StormAtlasHook][Storm]]
* HBase - _documentation work-in-progress_
The integration implies two things:
There are metadata models that Atlas defines natively to represent objects of these components.
......@@ -80,12 +72,6 @@ for the Hadoop ecosystem having wide integration with a variety of Hadoop compon
Ranger allows security administrators to define metadata driven security policies for effective governance.
Ranger is a consumer to the metadata change events notified by Atlas.
*Business Taxonomy*: The metadata objects ingested into Atlas from the Metadata sources are primarily a form
of technical metadata. To enhance the discoverability and governance capabilities, Atlas comes with a Business
Taxonomy interface that allows users to first, define a hierarchical set of business terms that represent their
business domain and associate them to the metadata entities Atlas manages. Business Taxonomy is a web application that
is part of the Atlas Admin UI currently and integrates with Atlas using the REST API.
......
---+ Falcon Atlas Bridge
---++ Falcon Model
The default falcon modelling is available in org.apache.atlas.falcon.model.FalconDataModelGenerator. It defines the following types:
<verbatim>
falcon_cluster(ClassType) - super types [Infrastructure] - attributes [timestamp, colo, owner, tags]
falcon_feed(ClassType) - super types [DataSet] - attributes [timestamp, stored-in, owner, groups, tags]
falcon_feed_creation(ClassType) - super types [Process] - attributes [timestamp, stored-in, owner]
falcon_feed_replication(ClassType) - super types [Process] - attributes [timestamp, owner]
falcon_process(ClassType) - super types [Process] - attributes [timestamp, runs-on, owner, tags, pipelines, workflow-properties]
</verbatim>
The default hive model includes the following types:
* Entity types:
* falcon_cluster
* super-types: Infrastructure
* attributes: timestamp, colo, owner, tags
* falcon_feed
* super-types: !DataSet
* attributes: timestamp, stored-in, owner, groups, tags
* falcon_feed_creation
* super-types: Process
* attributes: timestamp, stored-in, owner
* falcon_feed_replication
* super-types: Process
* attributes: timestamp, owner
* falcon_process
* super-types: Process
* attributes: timestamp, runs-on, owner, tags, pipelines, workflow-properties
One falcon_process entity is created for every cluster that the falcon process is defined for.
The entities are created and de-duped using unique qualifiedName attribute. They provide namespace and can be used for querying/lineage as well. The unique attributes are:
* falcon_process - <process name>@<cluster name>
* falcon_cluster - <cluster name>
* falcon_feed - <feed name>@<cluster name>
* falcon_feed_creation - <feed name>
* falcon_feed_replication - <feed name>
* falcon_process.qualifiedName - <process name>@<cluster name>
* falcon_cluster.qualifiedName - <cluster name>
* falcon_feed.qualifiedName - <feed name>@<cluster name>
* falcon_feed_creation.qualifiedName - <feed name>
* falcon_feed_replication.qualifiedName - <feed name>
---++ Falcon Hook
Falcon supports listeners on falcon entity submission. This is used to add entities in Atlas using the model defined in org.apache.atlas.falcon.model.FalconDataModelGenerator.
The hook submits the request to a thread pool executor to avoid blocking the command execution. The thread submits the entities as message to the notification server and atlas server reads these messages and registers the entities.
Falcon supports listeners on falcon entity submission. This is used to add entities in Atlas using the model detailed above.
Follow the instructions below to setup Atlas hook in Falcon:
* Add 'org.apache.atlas.falcon.service.AtlasService' to application.services in <falcon-conf>/startup.properties
* Link falcon hook jars in falcon classpath - 'ln -s <atlas-home>/hook/falcon/* <falcon-home>/server/webapp/falcon/WEB-INF/lib/'
* Link Atlas hook jars in Falcon classpath - 'ln -s <atlas-home>/hook/falcon/* <falcon-home>/server/webapp/falcon/WEB-INF/lib/'
* In <falcon_conf>/falcon-env.sh, set an environment variable as follows:
<verbatim>
export FALCON_SERVER_OPTS="<atlas_home>/hook/falcon/*:$FALCON_SERVER_OPTS"
</verbatim>
export FALCON_SERVER_OPTS="<atlas_home>/hook/falcon/*:$FALCON_SERVER_OPTS"</verbatim>
The following properties in <atlas-conf>/atlas-application.properties control the thread pool and notification details:
* atlas.hook.falcon.synchronous - boolean, true to run the hook synchronously. default false
* atlas.hook.falcon.numRetries - number of retries for notification failure. default 3
* atlas.hook.falcon.minThreads - core number of threads. default 5
* atlas.hook.falcon.maxThreads - maximum number of threads. default 5
* atlas.hook.falcon.synchronous - boolean, true to run the hook synchronously. default false
* atlas.hook.falcon.numRetries - number of retries for notification failure. default 3
* atlas.hook.falcon.minThreads - core number of threads. default 5
* atlas.hook.falcon.maxThreads - maximum number of threads. default 5
* atlas.hook.falcon.keepAliveTime - keep alive time in msecs. default 10
* atlas.hook.falcon.queueSize - queue size for the threadpool. default 10000
* atlas.hook.falcon.queueSize - queue size for the threadpool. default 10000
Refer [[Configuration][Configuration]] for notification related configurations
---++ Limitations
---++ NOTES
* In falcon cluster entity, cluster name used should be uniform across components like hive, falcon, sqoop etc. If used with ambari, ambari cluster name should be used for cluster entity
---+ Sqoop Atlas Bridge
---++ Sqoop Model
The default Sqoop modelling is available in org.apache.atlas.sqoop.model.SqoopDataModelGenerator. It defines the following types:
<verbatim>
sqoop_operation_type(EnumType) - values [IMPORT, EXPORT, EVAL]
sqoop_dbstore_usage(EnumType) - values [TABLE, QUERY, PROCEDURE, OTHER]
sqoop_process(ClassType) - super types [Process] - attributes [name, operation, dbStore, hiveTable, commandlineOpts, startTime, endTime, userName]
sqoop_dbdatastore(ClassType) - super types [DataSet] - attributes [name, dbStoreType, storeUse, storeUri, source, description, ownerName]
</verbatim>
The default hive model includes the following types:
* Entity types:
* sqoop_process
* super-types: Process
* attributes: name, operation, dbStore, hiveTable, commandlineOpts, startTime, endTime, userName
* sqoop_dbdatastore
* super-types: !DataSet
* attributes: name, dbStoreType, storeUse, storeUri, source, description, ownerName
* Enum types:
* sqoop_operation_type
* values: IMPORT, EXPORT, EVAL
* sqoop_dbstore_usage
* values: TABLE, QUERY, PROCEDURE, OTHER
The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying as well:
sqoop_process - attribute name - sqoop-dbStoreType-storeUri-endTime
sqoop_dbdatastore - attribute name - dbStoreType-connectorUrl-source
* sqoop_process.qualifiedName - dbStoreType-storeUri-endTime
* sqoop_dbdatastore.qualifiedName - dbStoreType-storeUri-source
---++ Sqoop Hook
Sqoop added a !SqoopJobDataPublisher that publishes data to Atlas after completion of import Job. Today, only hiveImport is supported in sqoopHook.
This is used to add entities in Atlas using the model defined in org.apache.atlas.sqoop.model.SqoopDataModelGenerator.
Follow these instructions in your sqoop set-up to add sqoop hook for Atlas in <sqoop-conf>/sqoop-site.xml:
Sqoop added a !SqoopJobDataPublisher that publishes data to Atlas after completion of import Job. Today, only hiveImport is supported in !SqoopHook.
This is used to add entities in Atlas using the model detailed above.
Follow the instructions below to setup Atlas hook in Hive:
* Sqoop Job publisher class. Currently only one publishing class is supported
Add the following properties to to enable Atlas hook in Sqoop:
* Set-up Atlas hook in <sqoop-conf>/sqoop-site.xml by adding the following:
<verbatim>
<property>
<name>sqoop.job.data.publish.class</name>
<value>org.apache.atlas.sqoop.hook.SqoopHook</value>
</property>
* Atlas cluster name
<property>
<name>atlas.cluster.name</name>
<value><clustername></value>
</property>
</property></verbatim>
* Copy <atlas-conf>/atlas-application.properties to to the sqoop conf directory <sqoop-conf>/
* Link <atlas-home>/hook/sqoop/*.jar in sqoop lib
Refer [[Configuration][Configuration]] for notification related configurations
---++ Limitations
---++ NOTES
* Only the following sqoop operations are captured by sqoop hook currently - hiveImport
......@@ -157,9 +157,9 @@ At a high level the following points can be called out:
---++ Metadata Store
As described above, Atlas uses Titan to store the metadata it manages. By default, Atlas uses a standalone HBase
instance as the backing store for Titan. In order to provide HA for the metadata store, we recommend that Atlas be
configured to use distributed HBase as the backing store for Titan. Doing this implies that you could benefit from the
As described above, Atlas uses JanusGraph to store the metadata it manages. By default, Atlas uses a standalone HBase
instance as the backing store for JanusGraph. In order to provide HA for the metadata store, we recommend that Atlas be
configured to use distributed HBase as the backing store for JanusGraph. Doing this implies that you could benefit from the
HA guarantees HBase provides. In order to configure Atlas to use HBase in HA mode, do the following:
* Choose an existing HBase cluster that is set up in HA mode to configure in Atlas (OR) Set up a new HBase cluster in [[http://hbase.apache.org/book.html#quickstart_fully_distributed][HA mode]].
......@@ -169,8 +169,8 @@ HA guarantees HBase provides. In order to configure Atlas to use HBase in HA mod
---++ Index Store
As described above, Atlas indexes metadata through Titan to support full text search queries. In order to provide HA
for the index store, we recommend that Atlas be configured to use Solr as the backing index store for Titan. In order
As described above, Atlas indexes metadata through JanusGraph to support full text search queries. In order to provide HA
for the index store, we recommend that Atlas be configured to use Solr as the backing index store for JanusGraph. In order
to configure Atlas to use Solr in HA mode, do the following:
* Choose an existing !SolrCloud cluster setup in HA mode to configure in Atlas (OR) Set up a new [[https://cwiki.apache.org/confluence/display/solr/SolrCloud][SolrCloud cluster]].
......@@ -208,4 +208,4 @@ to configure Atlas to use Kafka in HA mode, do the following:
---++ Known Issues
* If the HBase region servers hosting the Atlas ‘titan’ HTable are down, Atlas would not be able to store or retrieve metadata from HBase until they are brought back online.
\ No newline at end of file
* If the HBase region servers hosting the Atlas table are down, Atlas would not be able to store or retrieve metadata from HBase until they are brought back online.
\ No newline at end of file
---+ Quick Start Guide
---+ Quick Start
---++ Introduction
This quick start user guide is a simple client that adds a few sample type definitions modeled
after the example as shown below. It also adds example entities along with traits as shown in the
instance graph below.
Quick start is a simple client that adds a few sample type definitions modeled after the example shown below.
It also adds sample entities along with traits as shown in the instance graph below.
---+++ Example Type Definitions
......
---+ Repository
---++ Introduction
......@@ -7,39 +7,49 @@ Atlas is a scalable and extensible set of core foundational governance services
enterprises to effectively and efficiently meet their compliance requirements within Hadoop and
allows integration with the whole enterprise data ecosystem.
Apache Atlas provides open metadata management and governance capabilities for organizations
to build a catalog of their data assets, classify and govern these assets and provide collaboration
capabilities around these data assets for data scientists, analysts and the data governance team.
---++ Features
---+++ Data Classification
* Import or define taxonomy business-oriented annotations for data
* Define, annotate, and automate capture of relationships between data sets and underlying elements including source, target, and derivation processes
* Export metadata to third-party systems
---+++ Metadata types & instances
* Pre-defined types for various Hadoop and non-Hadoop metadata
* Ability to define new types for the metadata to be managed
* Types can have primitive attributes, complex attributes, object references; can inherit from other types
* Instances of types, called entities, capture metadata object details and their relationships
* REST APIs to work with types and instances allow easier integration
---+++ Classification
* Ability to dynamically create classifications - like PII, EXPIRES_ON, DATA_QUALITY, SENSITIVE
* Classifications can include attributes - like expiry_date attribute in EXPIRES_ON classification
* Entities can be associated with multiple classifications, enabling easier discovery and security enforcement
---+++ Centralized Auditing
* Capture security access information for every application, process, and interaction with data
* Capture the operational information for execution, steps, and activities
---+++ Lineage
* Intuitive UI to view lineage of data as it moves through various processes
* REST APIs to access and update lineage
---+++ Search & Lineage (Browse)
* Pre-defined navigation paths to explore the data classification and audit information
* Text-based search features locates relevant data and audit event across Data Lake quickly and accurately
* Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information
---+++ Search/Discovery
* Intuitive UI to search entities by type, classification, attribute value or free-text
* Rich REST APIs to search by complex criteria
* SQL like query language to search entities - Domain Specific Language (DSL)
---+++ Security & Policy Engine
* Rationalize compliance policy at runtime based on data classification schemes, attributes and roles.
* Advanced definition of policies for preventing data derivation based on classification (i.e. re-identification) – Prohibitions
* Column and Row level masking based on cell values and attibutes.
---+++ Security & Data Masking
* Integration with Apache Ranger enables authorization/data-masking based on classifications associated with entities in Apache Atlas. For example:
* who can access data classified as PII, SENSITIVE
* customer-service users can only see last 4 digits of columns classified as NATIONAL_ID
---++ Getting Started
* [[InstallationSteps][Install Steps]]
* [[QuickStart][Quick Start Guide]]
* [[InstallationSteps][Build & Install]]
* [[QuickStart][Quick Start]]
---++ Documentation
* [[Architecture][High Level Architecture]]
* [[TypeSystem][Type System]]
* [[Repository][Metadata Repository]]
* [[Search][Search]]
* [[security][Security]]
* [[Authentication-Authorization][Authentication and Authorization]]
......
......@@ -43,7 +43,7 @@ The properties for configuring service authentication are:
* <code>atlas.authentication.keytab</code> - the path to the keytab file.
* <code>atlas.authentication.principal</code> - the principal to use for authenticating to the KDC. The principal is generally of the form "user/host@realm". You may use the '_HOST' token for the hostname and the local hostname will be substituted in by the runtime (e.g. "Atlas/_HOST@EXAMPLE.COM").
Note that when Atlas is configured with HBase as the storage backend in a secure cluster, the graph db (titan) needs sufficient user permissions to be able to create and access an HBase table. To grant the appropriate permissions see [[Configuration][Graph persistence engine - Hbase]].
Note that when Atlas is configured with HBase as the storage backend in a secure cluster, the graph db (JanusGraph) needs sufficient user permissions to be able to create and access an HBase table. To grant the appropriate permissions see [[Configuration][Graph persistence engine - Hbase]].
---+++ JAAS configuration
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment