InstallationSteps.twiki 19 KB
Newer Older
1 2
---++ Building & Installing Apache Atlas

3
---+++ Building Apache Atlas
4
Download Apache Atlas 2.0.0 release sources, apache-atlas-2.0.0-sources.tar.gz, from [[http://atlas.apache.org/Downloads.html][downloads]] page.
5
Then follow the instructions below to to build Apache Atlas.
6
<verbatim>
7 8
tar xvfz apache-atlas-2.0.0-sources.tar.gz
cd apache-atlas-sources-2.0.0/
9 10
export MAVEN_OPTS="-Xms2g -Xmx2g"
mvn clean -DskipTests install</verbatim>
11

12

13 14
---+++ Packaging Apache Atlas
To create Apache Atlas package for deployment in an environment having functional Apache HBase and Apache Solr instances, build with the following command:
15 16

<verbatim>
17
mvn clean -DskipTests package -Pdist</verbatim>
18

19 20
   * NOTES:
      * Remove option '-DskipTests' to run unit and integration tests
21
      * To build a distribution without minified js,css file, build with _skipMinify_ profile. By default js and css files are minified.
22 23


24
Above will build Apache Atlas for an environment having functional HBase and Solr instances. Apache Atlas needs to be setup with the following to run in this environment:
25 26
   * Configure atlas.graph.storage.hostname (see "Graph persistence engine - HBase" in the [[Configuration][Configuration]] section).
   * Configure atlas.graph.index.search.solr.zookeeper-url (see "Graph Search Index - Solr" in the [[Configuration][Configuration]] section).
27 28
   * Set HBASE_CONF_DIR to point to a valid Apache HBase config directory (see "Graph persistence engine - HBase" in the [[Configuration][Configuration]] section).
   * Create indices in Apache Solr (see "Graph Search Index - Solr" in the [[Configuration][Configuration]] section).
29

30

31 32
---+++ Packaging Apache Atlas with embedded Apache HBase & Apache Solr
To create Apache Atlas package that includes Apache HBase and Apache Solr, build with the embedded-hbase-solr profile as shown below:
33 34

<verbatim>
35
mvn clean -DskipTests package -Pdist,embedded-hbase-solr</verbatim>
36

37
Using the embedded-hbase-solr profile will configure Apache Atlas so that an Apache HBase instance and an Apache Solr instance will be started and stopped along with the Apache Atlas server.
38

39 40 41 42
NOTE: This distribution profile is only intended to be used for single node development not in production.

---+++ Packaging Apache Atlas with embedded Apache Cassandra & Apache Solr
To create Apache Atlas package that includes Apache Cassandra and Apache Solr, build with the embedded-cassandra-solr profile as shown below:
43 44 45 46

<verbatim>
mvn clean package -Pdist,embedded-cassandra-solr</verbatim>

47
Using the embedded-cassandra-solr profile will configure Apache Atlas so that an Apache Cassandra instance and an Apache Solr instance will be started and stopped along with the Atlas server.
48 49

NOTE: This distribution profile is only intended to be used for single node development not in production.
50

51 52
---+++ Apache Atlas Package
Build will create following files, which are used to install Apache Atlas.
53 54

<verbatim>
55 56
distro/target/apache-atlas-${project.version}-bin.tar.gz
distro/target/apache-atlas-${project.version}-hbase-hook.tar.gz
57 58 59
distro/target/apache-atlas-${project.version}-hive-hook.gz
distro/target/apache-atlas-${project.version}-kafka-hook.gz
distro/target/apache-atlas-${project.version}-sources.tar.gz
60
distro/target/apache-atlas-${project.version}-sqoop-hook.tar.gz
61
distro/target/apache-atlas-${project.version}-storm-hook.tar.gz</verbatim>
62

63
---+++ Installing & Running Apache Atlas
64

65 66
---++++ Installing Apache Atlas
From the directory you would like Apache Atlas to be installed, run the following commands:
67
<verbatim>
68
tar -xzvf apache-atlas-${project.version}-server.tar.gz
69
cd atlas-${project.version}</verbatim>
70

71 72 73 74 75 76 77 78 79 80 81 82 83
---++++ Running Apache Atlas with Local Apache HBase & Apache Solr
To run Apache Atlas with local Apache HBase & Apache Solr instances that are started/stopped along with Atlas start/stop, run following commands:
<verbatim>
export MANAGE_LOCAL_HBASE=true
export MANAGE_LOCAL_SOLR=true

bin/atlas_start.py</verbatim>

---++++ Using Apache Atlas
   * To verify if Apache Atlas server is up and running, run curl command as shown below:
<verbatim>
  curl -u username:password http://localhost:21000/api/atlas/admin/version

84
  {"Description":"Metadata Management and Data Governance Platform over Hadoop","Version":"2.0.0","Name":"apache-atlas"}</verbatim>
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99

   * Run quick start to load sample model and data
<verbatim>
  bin/quick_start.py
  Enter username for atlas :-
  Enter password for atlas :-
</verbatim>

   * Access Apache Atlas UI using a browser: http://localhost:21000

---++++ Stopping Apache Atlas Server
To stop Apache Atlas, run following command:
<verbatim>
bin/atlas_stop.py</verbatim>

100

101 102 103 104
---+++ Configuring Apache Atlas
By default config directory used by Apache Atlas is _{package dir}/conf_. To override this set environment variable ATLAS_CONF to the path of the conf dir.

Environment variables needed to run Apache Atlas can be set in _atlas-env.sh_ file in the conf directory. This file will be sourced by Apache Atlas scripts before any commands are executed. The following environment variables are available to set.
105 106 107 108 109 110

<verbatim>
# The java implementation to use. If JAVA_HOME is not found we expect java and jar to be in path
#export JAVA_HOME=

# any additional java opts you want to set. This will apply to both client and server operations
111
#export ATLAS_OPTS=
112 113

# any additional java opts that you want to set for client only
114
#export ATLAS_CLIENT_OPTS=
115 116

# java heap size we want to set for the client. Default is 1024MB
117
#export ATLAS_CLIENT_HEAP=
118 119

# any additional opts you want to set for atlas service.
120
#export ATLAS_SERVER_OPTS=
121 122

# java heap size we want to set for the atlas server. Default is 1024MB
123
#export ATLAS_SERVER_HEAP=
124

125
# What is is considered as atlas home dir. Default is the base location of the installed software
126
#export ATLAS_HOME_DIR=
127 128

# Where log files are stored. Defatult is logs directory under the base install location
129
#export ATLAS_LOG_DIR=
130 131

# Where pid files are stored. Defatult is logs directory under the base install location
132
#export ATLAS_PID_DIR=
133 134

# Where do you want to expand the war file. By Default it is in /server/webapp dir under the base install dir.
135
#export ATLAS_EXPANDED_WEBAPP_DIR=</verbatim>
136

137 138
*Settings to support large number of metadata objects*

139
If you plan to store large number of metadata objects, it is recommended that you use values tuned for better GC performance of the JVM.
140 141 142

The following values are common server side options:
<verbatim>
143
export ATLAS_SERVER_OPTS="-server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof -Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps"</verbatim>
144

145
The =-XX:SoftRefLRUPolicyMSPerMB= option was found to be particularly helpful to regulate GC performance for query heavy workloads with many concurrent users.
146 147 148

The following values are recommended for JDK 8:
<verbatim>
149
export ATLAS_SERVER_HEAP="-Xms15360m -Xmx15360m -XX:MaxNewSize=5120m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m"</verbatim>
150 151

*NOTE for Mac OS users*
152
If you are using a Mac OS, you will need to configure the ATLAS_SERVER_OPTS (explained above).
153

154
In _{package dir}/conf/atlas-env.sh_ uncomment the following line
155
<verbatim>
156
#export ATLAS_SERVER_OPTS=</verbatim>
157 158

and change it to look as below
159
<verbatim>
160
export ATLAS_SERVER_OPTS="-Djava.awt.headless=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="</verbatim>
161

162
*Configuring Apache HBase as the storage backend for the Graph Repository*
163

164
By default, Apache Atlas uses JanusGraph as the graph repository and is the only graph repository implementation available currently. Apache HBase versions currently supported are 1.1.x. For configuring Apache Atlas graph persistence on Apache HBase, please see "Graph persistence engine - HBase" in the [[Configuration][Configuration]] section for more details.
165

166
Apache HBase tables used by Apache Atlas can be set using the following configurations:
167
<verbatim>
168 169
atlas.graph.storage.hbase.table=atlas
atlas.audit.hbase.tablename=apache_atlas_entity_audit</verbatim>
170

171
*Configuring Apache Solr as the indexing backend for the Graph Repository*
172

173
By default, Apache Atlas uses JanusGraph as the graph repository and is the only graph repository implementation available currently. For configuring JanusGraph to work with Apache Solr, please follow the instructions below
174

175
   * Install Apache Solr if not already running. The version of Apache Solr supported is 5.5.1. Could be installed from http://archive.apache.org/dist/lucene/solr/5.5.1/solr-5.5.1.tgz
176

177
   * Start Apache Solr in cloud mode.
178
  !SolrCloud mode uses a !ZooKeeper Service as a highly available, central location for cluster management.
179 180
  For a small cluster, running with an existing !ZooKeeper quorum should be fine. For larger clusters, you would want to run separate multiple !ZooKeeper quorum with at least 3 servers.
  Note: Apache Atlas currently supports Apache Solr in "cloud" mode only. "http" mode is not supported. For more information, refer Apache Solr documentation - https://cwiki.apache.org/confluence/display/solr/SolrCloud
181

182
   * For e.g., to bring up an Apache Solr node listening on port 8983 on a machine, you can use the command:
183
      <verbatim>
184
      $SOLR_HOME/bin/solr start -c -z <zookeeper_host:port> -p 8983</verbatim>
185

186
   * Run the following commands from SOLR_BIN (e.g. $SOLR_HOME/bin) directory to create collections in Apache Solr corresponding to the indexes that Apache Atlas uses. In the case that the Apache Atlas and Apache Solr instances are on 2 different hosts, first copy the required configuration files from ATLAS_HOME/conf/solr on the Apache Atlas instance host to Apache Solr instance host. SOLR_CONF in the below mentioned commands refer to the directory where Apache Solr configuration files have been copied to on Apache Solr host:
187

188
<verbatim>
189 190
  $SOLR_BIN/solr create -c vertex_index -d SOLR_CONF -shards #numShards -replicationFactor #replicationFactor
  $SOLR_BIN/solr create -c edge_index -d SOLR_CONF -shards #numShards -replicationFactor #replicationFactor
191
  $SOLR_BIN/solr create -c fulltext_index -d SOLR_CONF -shards #numShards -replicationFactor #replicationFactor</verbatim>
192 193 194

  Note: If numShards and replicationFactor are not specified, they default to 1 which suffices if you are trying out solr with ATLAS on a single node instance.
  Otherwise specify numShards according to the number of hosts that are in the Solr cluster and the maxShardsPerNode configuration.
195
  The number of shards cannot exceed the total number of Solr nodes in your !SolrCloud cluster.
196 197

  The number of replicas (replicationFactor) can be set according to the redundancy required.
198

199
  Also note that Apache Solr will automatically be called to create the indexes when Apache Atlas server is started if the
200 201
  SOLR_BIN and SOLR_CONF environment variables are set and the search indexing backend is set to 'solr5'.

202
   * Change ATLAS configuration to point to Apache Solr instance setup. Please make sure the following configurations are set to the below values in ATLAS_HOME/conf/atlas-application.properties
203
<verbatim>
204
 atlas.graph.index.search.backend=solr
205 206
 atlas.graph.index.search.solr.mode=cloud
 atlas.graph.index.search.solr.zookeeper-url=<the ZK quorum setup for solr as comma separated value> eg: 10.1.6.4:2181,10.1.6.5:2181
207
 atlas.graph.index.search.solr.zookeeper-connect-timeout=<SolrCloud Zookeeper Connection Timeout>. Default value is 60000 ms
208
 atlas.graph.index.search.solr.zookeeper-session-timeout=<SolrCloud Zookeeper Session Timeout>. Default value is 60000 ms</verbatim>
209

210
For more information on JanusGraph solr configuration , please refer http://docs.janusgraph.org/0.2.0/solr.html
211

212 213 214 215 216
Pre-requisites for running Apache Solr in cloud mode
  * Memory - Apache Solr is both memory and CPU intensive. Make sure the server running Apache Solr has adequate memory, CPU and disk.
    Apache Solr works well with 32GB RAM. Plan to provide as much memory as possible to Apache Solr process
  * Disk - If the number of entities that need to be stored are large, plan to have at least 500 GB free space in the volume where Apache Solr is going to store the index data
  * !SolrCloud has support for replication and sharding. It is highly recommended to use !SolrCloud with at least two Apache Solr nodes running on different servers with replication enabled.
217
    If using !SolrCloud, then you also need !ZooKeeper installed and configured with 3 or 5 !ZooKeeper nodes
218

219
*Configuring Elasticsearch as the indexing backend for the Graph Repository (Tech Preview)*
220

221
By default, Apache Atlas uses JanusGraph as the graph repository and is the only graph repository implementation available currently. For configuring JanusGraph to work with Elasticsearch, please follow the instructions below
222 223 224 225 226

   * Install an Elasticsearch cluster. The version currently supported is 5.6.4, and can be acquired from: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.4.tar.gz

   * For simple testing a single Elasticsearch node can be started by using the 'elasticsearch' command in the bin directory of the Elasticsearch distribution.

227
   * Change Apache Atlas configuration to point to the Elasticsearch instance setup. Please make sure the following configurations are set to the below values in ATLAS_HOME/conf/atlas-application.properties
228 229 230
<verbatim>
 atlas.graph.index.search.backend=elasticsearch
 atlas.graph.index.search.hostname=<the hostname(s) of the Elasticsearch master nodes comma separated>
231
 atlas.graph.index.search.elasticsearch.client-only=true</verbatim>
232

233
For more information on JanusGraph configuration for elasticsearch, please refer http://docs.janusgraph.org/0.2.0/elasticsearch.html
234

235 236
*Configuring Kafka Topics*

237 238 239 240
Apache Atlas uses Apache Kafka to ingest metadata from other components at runtime. This is described in the [[Architecture][Architecture page]]
in more detail. Depending on the configuration of Apache Kafka, sometimes you might need to setup the topics explicitly before
using Apache Atlas. To do so, Apache Atlas provides a script =bin/atlas_kafka_setup.py= which can be run from Apache Atlas server. In some
environments, the hooks might start getting used first before Apache Atlas server itself is setup. In such cases, the topics
241 242 243 244
can be run on the hosts where hooks are installed using a similar script =hook-bin/atlas_kafka_setup_hook.py=. Both these
use configuration in =atlas-application.properties= for setting up the topics. Please refer to the [[Configuration][Configuration page]]
for these details.

245 246
---++++ Setting up Apache Atlas
There are a few steps that setup dependencies of Apache Atlas. One such example is setting up the JanusGraph schema in the storage backend of choice. In a simple single server setup, these are automatically setup with default configuration when the server first accesses these dependencies.
247

248
However, there are scenarios when we may want to run setup steps explicitly as one time operations. For example, in a multiple server scenario using [[HighAvailability][High Availability]], it is preferable to run setup steps from one of the server instances the first time, and then start the services.
249

250
To run these steps one time, execute the command =bin/atlas_start.py -setup= from a single Apache Atlas server instance.
251

252
However, Apache Atlas server does take care of parallel executions of the setup steps. Also, running the setup steps multiple times is idempotent. Therefore, if one chooses to run the setup steps as part of server startup, for convenience, then they should enable the configuration option =atlas.server.run.setup.on.start= by defining it with the value =true= in the =atlas-application.properties= file.
253

254 255
---+++ Examples: calling Apache Atlas REST APIs
Here are few examples of calling Apache Atlas REST APIs via curl command.
256 257
   * List the types in the repository
<verbatim>
258
  curl -u username:password http://localhost:21000/api/atlas/v2/types/typedefs/headers
259 260 261 262 263 264
  [ {"guid":"fa421be8-c21b-4cf8-a226-fdde559ad598","name":"Referenceable","category":"ENTITY"},
    {"guid":"7f3f5712-521d-450d-9bb2-ba996b6f2a4e","name":"Asset","category":"ENTITY"},
    {"guid":"84b02fa0-e2f4-4cc4-8b24-d2371cd00375","name":"DataSet","category":"ENTITY"},
    {"guid":"f93975d5-5a5c-41da-ad9d-eb7c4f91a093","name":"Process","category":"ENTITY"},
    {"guid":"79dcd1f9-f350-4f7b-b706-5bab416f8206","name":"Infrastructure","category":"ENTITY"}
  ]</verbatim>
265

266 267
   * List the instances for a given type
<verbatim>
268
  curl -u username:password http://localhost:21000/api/atlas/v2/search/basic?typeName=hive_db
269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312
  {
    "queryType":"BASIC",
    "searchParameters":{
      "typeName":"hive_db",
      "excludeDeletedEntities":false,
      "includeClassificationAttributes":false,
      "includeSubTypes":true,
      "includeSubClassifications":true,
      "limit":100,
      "offset":0
    },
    "entities":[
      {
        "typeName":"hive_db",
        "guid":"5d900c19-094d-4681-8a86-4eb1d6ffbe89",
        "status":"ACTIVE",
        "displayText":"default",
        "classificationNames":[],
        "attributes":{
          "owner":"public",
          "createTime":null,
          "qualifiedName":"default@cl1",
          "name":"default",
          "description":"Default Hive database"
        }
      },
      {
        "typeName":"hive_db",
        "guid":"3a0b14b0-ab85-4b65-89f2-e418f3f7f77c",
        "status":"ACTIVE",
        "displayText":"finance",
        "classificationNames":[],
        "attributes":{
          "owner":"hive",
          "createTime":null,
          "qualifiedName":"finance@cl1",
          "name":"finance",
          "description":null
        }
      }
    ]
  }</verbatim>

   * Search for entities
313
<verbatim>
314
  curl -u username:password http://localhost:21000/api/atlas/v2/search/dsl?query=hive_db%20where%20name='default'
315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335
    {
      "queryType":"DSL",
      "queryText":"hive_db where name='default'",
      "entities":[
        {
          "typeName":"hive_db",
          "guid":"5d900c19-094d-4681-8a86-4eb1d6ffbe89",
          "status":"ACTIVE",
          "displayText":"default",
          "classificationNames":[],
          "attributes":{
            "owner":"public",
            "createTime":null,
            "qualifiedName":"default@cl1",
            "name":"default",
            "description":
            "Default Hive database"
          }
        }
      ]
    }</verbatim>
336 337


338
---+++ Troubleshooting
339

340
---++++ Setup issues
341
If the setup of Apache Atlas service fails due to any reason, the next run of setup (either by an explicit invocation of
342 343 344 345 346
=atlas_start.py -setup= or by enabling the configuration option =atlas.server.run.setup.on.start=) will fail with
a message such as =A previous setup run may not have completed cleanly.=. In such cases, you would need to manually
ensure the setup can run and delete the Zookeeper node at =/apache_atlas/setup_in_progress= before attempting to
run setup again.

347 348
If the setup failed due to Apache HBase schema setup errors, it may be necessary to repair Apache HBase schema. If no
data has been stored, one can also disable and drop the Apache HBase tables used by Apache Atlas and run setup again.