InstallationSteps.twiki 16.3 KB
Newer Older
1 2 3 4
---++ Building & Installing Apache Atlas

---+++ Building Atlas
<verbatim>
5
git clone https://git-wip-us.apache.org/repos/asf/atlas.git atlas
6
cd atlas
7 8
export MAVEN_OPTS="-Xms2g -Xmx2g"
mvn clean -DskipTests install</verbatim>
9

10

11 12
---+++ Packaging Atlas
To create Apache Atlas package for deployment in an environment having functional HBase and Solr instances, build with the following command:
13 14

<verbatim>
15
mvn clean -DskipTests package -Pdist</verbatim>
16

17 18 19
   * NOTES:
      * Remove option '-DskipTests' to run unit and integration tests
      * To build a distribution without minified js,css file, build with skipMinify profile. By default js and css files are minified.
20 21


22
Above will build Atlas for an environment having functional HBase and Solr instances. Atlas needs to be setup with the following to run in this environment:
23 24 25 26 27
   * Configure atlas.graph.storage.hostname (see "Graph persistence engine - HBase" in the [[Configuration][Configuration]] section).
   * Configure atlas.graph.index.search.solr.zookeeper-url (see "Graph Search Index - Solr" in the [[Configuration][Configuration]] section).
   * Set HBASE_CONF_DIR to point to a valid HBase config directory (see "Graph persistence engine - HBase" in the [[Configuration][Configuration]] section).
   * Create the SOLR indices (see "Graph Search Index - Solr" in the [[Configuration][Configuration]] section).

28

29 30
---+++ Packaging Atlas with Embedded HBase & Solr
To create Apache Atlas package that includes HBase and Solr, build with the embedded-hbase-solr profile as shown below:
31 32

<verbatim>
33
mvn clean -DskipTests package -Pdist,embedded-hbase-solr</verbatim>
34

35
Using the embedded-hbase-solr profile will configure Atlas so that an HBase instance and a Solr instance will be started and stopped along with the Atlas server by default.
36

37 38 39 40 41 42 43 44 45
---+++ Packaging Atlas with Embedded Cassandra & Solr
To create Apache Atlas package that includes Cassandra and Solr, build with the embedded-cassandra-solr profile as shown below:

<verbatim>
mvn clean package -Pdist,embedded-cassandra-solr</verbatim>

Using the embedded-cassandra-solr profile will configure Atlas so that an embedded Cassandra instance and a Solr instance will be started and stopped along with the Atlas server by default.

NOTE: This distribution profile is only intended to be used for single node development not in production.
46

47 48
---+++ Apache Atlas Package
Build will create following files, which are used to install Apache Atlas.
49 50

<verbatim>
51 52 53 54 55 56 57
distro/target/apache-atlas-${project.version}-bin.tar.gz
distro/target/apache-atlas-${project.version}-hive-hook.gz
distro/target/apache-atlas-${project.version}-hbase-hook.tar.gz
distro/target/apache-atlas-${project.version}-sqoop-hook.tar.gz
distro/target/apache-atlas-${project.version}-storm-hook.tar.gz
distro/target/apache-atlas-${project.version}-falcon-hook.tar.gz
distro/target/apache-atlas-${project.version}-sources.tar.gz</verbatim>
58

59 60
---+++ Installing & Running Atlas

61
---++++ Installing Atlas
62 63
<verbatim>
tar -xzvf apache-atlas-${project.version}-bin.tar.gz
64

65
cd atlas-${project.version}</verbatim>
66

67
---++++ Configuring Atlas
68
By default config directory used by Atlas is {package dir}/conf. To override this set environment variable ATLAS_CONF to the path of the conf dir.
69

70
Environment variables needed to run Atlas can be set in  atlas-env.sh file in the conf directory. This file will be sourced by Atlas scripts before any commands are executed. The following environment variables are available to set.
71 72 73 74 75 76

<verbatim>
# The java implementation to use. If JAVA_HOME is not found we expect java and jar to be in path
#export JAVA_HOME=

# any additional java opts you want to set. This will apply to both client and server operations
77
#export ATLAS_OPTS=
78 79

# any additional java opts that you want to set for client only
80
#export ATLAS_CLIENT_OPTS=
81 82

# java heap size we want to set for the client. Default is 1024MB
83
#export ATLAS_CLIENT_HEAP=
84 85

# any additional opts you want to set for atlas service.
86
#export ATLAS_SERVER_OPTS=
87 88

# java heap size we want to set for the atlas server. Default is 1024MB
89
#export ATLAS_SERVER_HEAP=
90

91
# What is is considered as atlas home dir. Default is the base location of the installed software
92
#export ATLAS_HOME_DIR=
93 94

# Where log files are stored. Defatult is logs directory under the base install location
95
#export ATLAS_LOG_DIR=
96 97

# Where pid files are stored. Defatult is logs directory under the base install location
98
#export ATLAS_PID_DIR=
99 100

# Where do you want to expand the war file. By Default it is in /server/webapp dir under the base install dir.
101
#export ATLAS_EXPANDED_WEBAPP_DIR=</verbatim>
102

103 104
*Settings to support large number of metadata objects*

105
If you plan to store large number of metadata objects, it is recommended that you use values tuned for better GC performance of the JVM.
106 107 108

The following values are common server side options:
<verbatim>
109
export ATLAS_SERVER_OPTS="-server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof -Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps"</verbatim>
110

111
The =-XX:SoftRefLRUPolicyMSPerMB= option was found to be particularly helpful to regulate GC performance for query heavy workloads with many concurrent users.
112 113 114

The following values are recommended for JDK 8:
<verbatim>
115
export ATLAS_SERVER_HEAP="-Xms15360m -Xmx15360m -XX:MaxNewSize=5120m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m"</verbatim>
116 117

*NOTE for Mac OS users*
118
If you are using a Mac OS, you will need to configure the ATLAS_SERVER_OPTS (explained above).
119 120

In  {package dir}/conf/atlas-env.sh uncomment the following line
121
<verbatim>
122
#export ATLAS_SERVER_OPTS=</verbatim>
123 124

and change it to look as below
125
<verbatim>
126
export ATLAS_SERVER_OPTS="-Djava.awt.headless=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="</verbatim>
127

128
*HBase as the Storage Backend for the Graph Repository*
129

130
By default, Atlas uses JanusGraph as the graph repository and is the only graph repository implementation available currently. The HBase versions currently supported are 1.1.x. For configuring ATLAS graph persistence on HBase, please see "Graph persistence engine - HBase" in the [[Configuration][Configuration]] section for more details.
131

132
HBase tables used by Atlas can be set using the following configurations:
133
<verbatim>
134 135
atlas.graph.storage.hbase.table=atlas
atlas.audit.hbase.tablename=apache_atlas_entity_audit</verbatim>
136

137
*Configuring SOLR as the Indexing Backend for the Graph Repository*
138

139
By default, Atlas uses JanusGraph as the graph repository and is the only graph repository implementation available currently. For configuring JanusGraph to work with Solr, please follow the instructions below
140

141
   * Install solr if not already running. The version of SOLR supported is 5.5.1. Could be installed from http://archive.apache.org/dist/lucene/solr/5.5.1/solr-5.5.1.tgz
142 143

   * Start solr in cloud mode.
144 145
  !SolrCloud mode uses a !ZooKeeper Service as a highly available, central location for cluster management.
  For a small cluster, running with an existing !ZooKeeper quorum should be fine. For larger clusters, you would want to run separate multiple !ZooKeeper quorum with atleast 3 servers.
146 147
  Note: Atlas currently supports solr in "cloud" mode only. "http" mode is not supported. For more information, refer solr documentation - https://cwiki.apache.org/confluence/display/solr/SolrCloud

148
   * For e.g., to bring up a Solr node listening on port 8983 on a machine, you can use the command:
149 150 151 152
      <verbatim>
      $SOLR_HOME/bin/solr start -c -z <zookeeper_host:port> -p 8983
      </verbatim>

153
   * Run the following commands from SOLR_BIN (e.g. $SOLR_HOME/bin) directory to create collections in Solr corresponding to the indexes that Atlas uses. In the case that the ATLAS and SOLR instance are on 2 different hosts, first copy the required configuration files from ATLAS_HOME/conf/solr on the ATLAS instance host to the Solr instance host. SOLR_CONF in the below mentioned commands refer to the directory where the solr configuration files have been copied to on Solr host:
154

155
<verbatim>
156 157
  $SOLR_BIN/solr create -c vertex_index -d SOLR_CONF -shards #numShards -replicationFactor #replicationFactor
  $SOLR_BIN/solr create -c edge_index -d SOLR_CONF -shards #numShards -replicationFactor #replicationFactor
158
  $SOLR_BIN/solr create -c fulltext_index -d SOLR_CONF -shards #numShards -replicationFactor #replicationFactor</verbatim>
159 160 161

  Note: If numShards and replicationFactor are not specified, they default to 1 which suffices if you are trying out solr with ATLAS on a single node instance.
  Otherwise specify numShards according to the number of hosts that are in the Solr cluster and the maxShardsPerNode configuration.
162
  The number of shards cannot exceed the total number of Solr nodes in your !SolrCloud cluster.
163 164

  The number of replicas (replicationFactor) can be set according to the redundancy required.
165

166 167 168
  Also note that solr will automatically be called to create the indexes when the Atlas server is started if the
  SOLR_BIN and SOLR_CONF environment variables are set and the search indexing backend is set to 'solr5'.

169
   * Change ATLAS configuration to point to the Solr instance setup. Please make sure the following configurations are set to the below values in ATLAS_HOME/conf/atlas-application.properties
170
<verbatim>
171
 atlas.graph.index.search.backend=solr5
172 173
 atlas.graph.index.search.solr.mode=cloud
 atlas.graph.index.search.solr.zookeeper-url=<the ZK quorum setup for solr as comma separated value> eg: 10.1.6.4:2181,10.1.6.5:2181
174
 atlas.graph.index.search.solr.zookeeper-connect-timeout=<SolrCloud Zookeeper Connection Timeout>. Default value is 60000 ms
175
 atlas.graph.index.search.solr.zookeeper-session-timeout=<SolrCloud Zookeeper Session Timeout>. Default value is 60000 ms</verbatim>
176

177 178
   * Restart Atlas

179
For more information on JanusGraph solr configuration , please refer http://docs.janusgraph.org/0.2.0/solr.html
180

181 182 183 184
Pre-requisites for running Solr in cloud mode
  * Memory - Solr is both memory and CPU intensive. Make sure the server running Solr has adequate memory, CPU and disk.
    Solr works well with 32GB RAM. Plan to provide as much memory as possible to Solr process
  * Disk - If the number of entities that need to be stored are large, plan to have at least 500 GB free space in the volume where Solr is going to store the index data
185 186
  * !SolrCloud has support for replication and sharding. It is highly recommended to use !SolrCloud with at least two Solr nodes running on different servers with replication enabled.
    If using !SolrCloud, then you also need !ZooKeeper installed and configured with 3 or 5 !ZooKeeper nodes
187

188 189 190 191 192 193 194 195 196 197
*Configuring Kafka Topics*

Atlas uses Kafka to ingest metadata from other components at runtime. This is described in the [[Architecture][Architecture page]]
in more detail. Depending on the configuration of Kafka, sometimes you might need to setup the topics explicitly before
using Atlas. To do so, Atlas provides a script =bin/atlas_kafka_setup.py= which can be run from the Atlas server. In some
environments, the hooks might start getting used first before Atlas server itself is setup. In such cases, the topics
can be run on the hosts where hooks are installed using a similar script =hook-bin/atlas_kafka_setup_hook.py=. Both these
use configuration in =atlas-application.properties= for setting up the topics. Please refer to the [[Configuration][Configuration page]]
for these details.

198
---++++ Setting up Atlas
199
There are a few steps that setup dependencies of Atlas. One such example is setting up the JanusGraph schema in the storage backend of choice. In a simple single server setup, these are automatically setup with default configuration when the server first accesses these dependencies.
200

201
However, there are scenarios when we may want to run setup steps explicitly as one time operations. For example, in a multiple server scenario using [[HighAvailability][High Availability]], it is preferable to run setup steps from one of the server instances the first time, and then start the services.
202 203 204

To run these steps one time, execute the command =bin/atlas_start.py -setup= from a single Atlas server instance.

205
However, the Atlas server does take care of parallel executions of the setup steps. Also, running the setup steps multiple times is idempotent. Therefore, if one chooses to run the setup steps as part of server startup, for convenience, then they should enable the configuration option =atlas.server.run.setup.on.start= by defining it with the value =true= in the =atlas-application.properties= file.
206 207

---++++ Starting Atlas Server
208
<verbatim>
209
bin/atlas_start.py [-port <port>]</verbatim>
210

211
---+++ Using Atlas
212
   * Verify if the server is up and running
213
<verbatim>
214 215
  curl -v -u username:password http://localhost:21000/api/atlas/admin/version
  {"Version":"v0.1"}</verbatim>
216

217 218 219
   * Access Atlas UI using a browser: http://localhost:21000

   * Run quick start to load sample model and data
220
<verbatim>
221
  bin/quick_start.py [<atlas endpoint>]</verbatim>
222

223 224
   * List the types in the repository
<verbatim>
225 226 227 228 229 230 231
  curl -v -u username:password http://localhost:21000/api/atlas/v2/types/typedefs/headers
  [ {"guid":"fa421be8-c21b-4cf8-a226-fdde559ad598","name":"Referenceable","category":"ENTITY"},
    {"guid":"7f3f5712-521d-450d-9bb2-ba996b6f2a4e","name":"Asset","category":"ENTITY"},
    {"guid":"84b02fa0-e2f4-4cc4-8b24-d2371cd00375","name":"DataSet","category":"ENTITY"},
    {"guid":"f93975d5-5a5c-41da-ad9d-eb7c4f91a093","name":"Process","category":"ENTITY"},
    {"guid":"79dcd1f9-f350-4f7b-b706-5bab416f8206","name":"Infrastructure","category":"ENTITY"}
  ]</verbatim>
232

233 234
   * List the instances for a given type
<verbatim>
235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279
  curl -v -u username:password http://localhost:21000/api/atlas/v2/search/basic?typeName=hive_db
  {
    "queryType":"BASIC",
    "searchParameters":{
      "typeName":"hive_db",
      "excludeDeletedEntities":false,
      "includeClassificationAttributes":false,
      "includeSubTypes":true,
      "includeSubClassifications":true,
      "limit":100,
      "offset":0
    },
    "entities":[
      {
        "typeName":"hive_db",
        "guid":"5d900c19-094d-4681-8a86-4eb1d6ffbe89",
        "status":"ACTIVE",
        "displayText":"default",
        "classificationNames":[],
        "attributes":{
          "owner":"public",
          "createTime":null,
          "qualifiedName":"default@cl1",
          "name":"default",
          "description":"Default Hive database"
        }
      },
      {
        "typeName":"hive_db",
        "guid":"3a0b14b0-ab85-4b65-89f2-e418f3f7f77c",
        "status":"ACTIVE",
        "displayText":"finance",
        "classificationNames":[],
        "attributes":{
          "owner":"hive",
          "createTime":null,
          "qualifiedName":"finance@cl1",
          "name":"finance",
          "description":null
        }
      }
    ]
  }</verbatim>

   * Search for entities
280
<verbatim>
281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302
  curl -v -u username:password http://localhost:21000/api/atlas/v2/search/dsl?query=hive_db%20where%20name='default'
    {
      "queryType":"DSL",
      "queryText":"hive_db where name='default'",
      "entities":[
        {
          "typeName":"hive_db",
          "guid":"5d900c19-094d-4681-8a86-4eb1d6ffbe89",
          "status":"ACTIVE",
          "displayText":"default",
          "classificationNames":[],
          "attributes":{
            "owner":"public",
            "createTime":null,
            "qualifiedName":"default@cl1",
            "name":"default",
            "description":
            "Default Hive database"
          }
        }
      ]
    }</verbatim>
303 304


305
---+++ Stopping Atlas Server
306
<verbatim>
307
bin/atlas_stop.py</verbatim>
308

309
---+++ Troubleshooting
310

311
---++++ Setup issues
312 313 314 315 316 317
If the setup of Atlas service fails due to any reason, the next run of setup (either by an explicit invocation of
=atlas_start.py -setup= or by enabling the configuration option =atlas.server.run.setup.on.start=) will fail with
a message such as =A previous setup run may not have completed cleanly.=. In such cases, you would need to manually
ensure the setup can run and delete the Zookeeper node at =/apache_atlas/setup_in_progress= before attempting to
run setup again.

318 319
If the setup failed due to HBase JanusGraph schema setup errors, it may be necessary to repair the HBase schema. If no
data has been stored, one can also disable and drop the HBase tables used by Atlas and run setup again.