InstallationSteps.twiki 16.7 KB
Newer Older
1 2 3 4 5 6 7 8 9
---++ Building & Installing Apache Atlas

---+++ Building Atlas

<verbatim>
git clone https://git-wip-us.apache.org/repos/asf/incubator-atlas.git atlas

cd atlas

10
export MAVEN_OPTS="-Xmx1536m -XX:MaxPermSize=512m" && mvn clean install
11 12 13 14 15 16
</verbatim>

Once the build successfully completes, artifacts can be packaged for deployment.

<verbatim>

17
mvn clean package -Pdist
18 19 20

</verbatim>

21 22 23 24
NOTE:
1. Use option '-DskipTests' to skip running unit and integration tests
2. Use option '-P perf' to instrument atlas to collect performance metrics

25
To build a distribution that configures Atlas for external HBase and Solr, build with the external-hbase-solr profile.
26 27 28

<verbatim>

29
mvn clean package -Pdist,external-hbase-solr
30 31 32

</verbatim>

33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Note that when the external-hbase-solr profile is used the following steps need to be completed to make Atlas functional.
   * Configure atlas.graph.storage.hostname (see "Graph persistence engine - HBase" in the [[Configuration][Configuration]] section).
   * Configure atlas.graph.index.search.solr.zookeeper-url (see "Graph Search Index - Solr" in the [[Configuration][Configuration]] section).
   * Set HBASE_CONF_DIR to point to a valid HBase config directory (see "Graph persistence engine - HBase" in the [[Configuration][Configuration]] section).
   * Create the SOLR indices (see "Graph Search Index - Solr" in the [[Configuration][Configuration]] section).

To build a distribution that packages HBase and Solr, build with the embedded-hbase-solr profile.

<verbatim>

mvn clean package -Pdist,embedded-hbase-solr

</verbatim>

Using the embedded-hbase-solr profile will configure Atlas so that an HBase instance and a Solr instance will be started
48 49
and stopped along with the Atlas server by default.

50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
Atlas also supports building a distribution that can use BerkeleyDB and Elastic search as the graph and index backends.
To build a distribution that is configured for these backends, build with the berkeley-elasticsearch profile.

<verbatim>

mvn clean package -Pdist,berkeley-elasticsearch

</verbatim>

An additional step is required for the binary built using this profile to be used along with the Atlas distribution.
Due to licensing requirements, Atlas does not bundle the BerkeleyDB Java Edition in the tarball.

You can download the Berkeley DB jar file from the URL: <verbatim>http://download.oracle.com/otn/berkeley-db/je-5.0.73.zip</verbatim>
and copy the je-5.0.73.jar to the ${atlas_home}/libext directory.

65
Tar can be found in atlas/distro/target/apache-atlas-${project.version}-bin.tar.gz
66 67 68 69 70 71 72 73 74 75 76 77

Tar is structured as follows

<verbatim>

|- bin
   |- atlas_start.py
   |- atlas_stop.py
   |- atlas_config.py
   |- quick_start.py
   |- cputil.py
|- conf
78
   |- atlas-application.properties
79
   |- atlas-env.sh
80 81
   |- hbase
      |- hbase-site.xml.template
82
   |- log4j.xml
83 84 85 86 87 88 89 90 91
   |- solr
      |- currency.xml
      |- lang
         |- stopwords_en.txt
      |- protowords.txt
      |- schema.xml
      |- solrconfig.xml
      |- stopwords.txt
      |- synonyms.txt
92
|- docs
93 94 95 96
|- hbase
   |- bin
   |- conf
   ...
97 98 99
|- server
   |- webapp
      |- atlas.war
100 101 102
|- solr
   |- bin
   ...
103
|- README
104 105
|- NOTICE
|- LICENSE
106 107 108 109 110
|- DISCLAIMER.txt
|- CHANGES.txt

</verbatim>

111
Note that if the embedded-hbase-solr profile is specified for the build then HBase and Solr are included in the
112 113 114 115 116
distribution.

In this case, a standalone instance of HBase can be started as the default storage backend for the graph repository.
During Atlas installation the conf/hbase/hbase-site.xml.template gets expanded and moved to hbase/conf/hbase-site.xml
for the initial standalone HBase configuration.  To configure ATLAS
117 118 119
graph persistence for a different HBase instance, please see "Graph persistence engine - HBase" in the
[[Configuration][Configuration]] section.

120 121 122
Also, a standalone instance of Solr can be started as the default search indexing backend.  To configure ATLAS search
indexing for a different Solr instance please see "Graph Search Index - Solr" in the
[[Configuration][Configuration]] section.
123

124 125
---+++ Installing & Running Atlas

126
---++++ Installing Atlas
127 128
<verbatim>
tar -xzvf apache-atlas-${project.version}-bin.tar.gz
129 130

cd atlas-${project.version}
131 132
</verbatim>

133
---++++ Configuring Atlas
134 135

By default config directory used by Atlas is {package dir}/conf. To override this set environment
136
variable ATLAS_CONF to the path of the conf dir.
137 138 139 140 141 142 143 144 145 146 147

atlas-env.sh has been added to the Atlas conf. This file can be used to set various environment
variables that you need for you services. In addition you can set any other environment
variables you might need. This file will be sourced by atlas scripts before any commands are
executed. The following environment variables are available to set.

<verbatim>
# The java implementation to use. If JAVA_HOME is not found we expect java and jar to be in path
#export JAVA_HOME=

# any additional java opts you want to set. This will apply to both client and server operations
148
#export ATLAS_OPTS=
149 150

# any additional java opts that you want to set for client only
151
#export ATLAS_CLIENT_OPTS=
152 153

# java heap size we want to set for the client. Default is 1024MB
154
#export ATLAS_CLIENT_HEAP=
155 156

# any additional opts you want to set for atlas service.
157
#export ATLAS_SERVER_OPTS=
158 159

# java heap size we want to set for the atlas server. Default is 1024MB
160
#export ATLAS_SERVER_HEAP=
161 162

# What is is considered as atlas home dir. Default is the base locaion of the installed software
163
#export ATLAS_HOME_DIR=
164 165

# Where log files are stored. Defatult is logs directory under the base install location
166
#export ATLAS_LOG_DIR=
167 168

# Where pid files are stored. Defatult is logs directory under the base install location
169
#export ATLAS_PID_DIR=
170 171

# where the atlas titan db data is stored. Defatult is logs/data directory under the base install location
172
#export ATLAS_DATA_DIR=
173 174

# Where do you want to expand the war file. By Default it is in /server/webapp dir under the base install dir.
175
#export ATLAS_EXPANDED_WEBAPP_DIR=
176 177
</verbatim>

178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199
*Settings to support large number of metadata objects*

If you plan to store several tens of thousands of metadata objects, it is recommended that you use values
tuned for better GC performance of the JVM.

The following values are common server side options:
<verbatim>
export ATLAS_SERVER_OPTS="-server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof -Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps"
</verbatim>

The =-XX:SoftRefLRUPolicyMSPerMB= option was found to be particularly helpful to regulate GC performance for
query heavy workloads with many concurrent users.

The following values are recommended for JDK 7:
<verbatim>
export ATLAS_SERVER_HEAP="-Xms15360m -Xmx15360m -XX:MaxNewSize=3072m -XX:PermSize=100M -XX:MaxPermSize=512m"
</verbatim>

The following values are recommended for JDK 8:
<verbatim>
export ATLAS_SERVER_HEAP="-Xms15360m -Xmx15360m -XX:MaxNewSize=5120m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m"
</verbatim>
200 201

*NOTE for Mac OS users*
202
If you are using a Mac OS, you will need to configure the ATLAS_SERVER_OPTS (explained above).
203 204

In  {package dir}/conf/atlas-env.sh uncomment the following line
205
<verbatim>
206
#export ATLAS_SERVER_OPTS=
207
</verbatim>
208 209

and change it to look as below
210
<verbatim>
211
export ATLAS_SERVER_OPTS="-Djava.awt.headless=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
212 213
</verbatim>

214
*Hbase as the Storage Backend for the Graph Repository*
215 216

By default, Atlas uses Titan as the graph repository and is the only graph repository implementation available currently.
217
The HBase versions currently supported are 1.1.x. For configuring ATLAS graph persistence on HBase, please see "Graph persistence engine - HBase" in the [[Configuration][Configuration]] section
218 219 220
for more details.

Pre-requisites for running HBase as a distributed cluster
221 222
   * 3 or 5 !ZooKeeper nodes
   * Atleast 3 !RegionServer nodes. It would be ideal to run the !DataNodes on the same hosts as the Region servers for data locality.
223

224 225 226 227 228 229
HBase tablename in Titan can be set using the following configuration in ATLAS_HOME/conf/atlas-application.properties:
<verbatim>
atlas.graph.storage.hbase.table=apache_atlas_titan
atlas.audit.hbase.tablename=apache_atlas_entity_audit
</verbatim>

230
*Configuring SOLR as the Indexing Backend for the Graph Repository*
231 232 233 234

By default, Atlas uses Titan as the graph repository and is the only graph repository implementation available currently.
For configuring Titan to work with Solr, please follow the instructions below

235 236 237
   * Install solr if not already running. The version of SOLR supported is 5.2.1. Could be installed from http://archive.apache.org/dist/lucene/solr/5.2.1/solr-5.2.1.tgz

   * Start solr in cloud mode.
238 239
  !SolrCloud mode uses a !ZooKeeper Service as a highly available, central location for cluster management.
  For a small cluster, running with an existing !ZooKeeper quorum should be fine. For larger clusters, you would want to run separate multiple !ZooKeeper quorum with atleast 3 servers.
240 241
  Note: Atlas currently supports solr in "cloud" mode only. "http" mode is not supported. For more information, refer solr documentation - https://cwiki.apache.org/confluence/display/solr/SolrCloud

242
   * For e.g., to bring up a Solr node listening on port 8983 on a machine, you can use the command:
243 244 245 246
      <verbatim>
      $SOLR_HOME/bin/solr start -c -z <zookeeper_host:port> -p 8983
      </verbatim>

247
   * Run the following commands from SOLR_BIN (e.g. $SOLR_HOME/bin) directory to create collections in Solr corresponding to the indexes that Atlas uses. In the case that the ATLAS and SOLR instance are on 2 different hosts,
248 249 250
  first copy the required configuration files from ATLAS_HOME/conf/solr on the ATLAS instance host to the Solr instance host. SOLR_CONF in the below mentioned commands refer to the directory where the solr configuration files
  have been copied to on Solr host:

251
<verbatim>
252 253 254
  $SOLR_BIN/solr create -c vertex_index -d SOLR_CONF -shards #numShards -replicationFactor #replicationFactor
  $SOLR_BIN/solr create -c edge_index -d SOLR_CONF -shards #numShards -replicationFactor #replicationFactor
  $SOLR_BIN/solr create -c fulltext_index -d SOLR_CONF -shards #numShards -replicationFactor #replicationFactor
255
</verbatim>
256 257 258

  Note: If numShards and replicationFactor are not specified, they default to 1 which suffices if you are trying out solr with ATLAS on a single node instance.
  Otherwise specify numShards according to the number of hosts that are in the Solr cluster and the maxShardsPerNode configuration.
259
  The number of shards cannot exceed the total number of Solr nodes in your !SolrCloud cluster.
260 261

  The number of replicas (replicationFactor) can be set according to the redundancy required.
262

263 264 265
  Also note that solr will automatically be called to create the indexes when the Atlas server is started if the
  SOLR_BIN and SOLR_CONF environment variables are set and the search indexing backend is set to 'solr5'.

266
   * Change ATLAS configuration to point to the Solr instance setup. Please make sure the following configurations are set to the below values in ATLAS_HOME/conf/atlas-application.properties
267
<verbatim>
268
 atlas.graph.index.search.backend=solr5
269 270
 atlas.graph.index.search.solr.mode=cloud
 atlas.graph.index.search.solr.zookeeper-url=<the ZK quorum setup for solr as comma separated value> eg: 10.1.6.4:2181,10.1.6.5:2181
271 272
 atlas.graph.index.search.solr.zookeeper-connect-timeout=<SolrCloud Zookeeper Connection Timeout>. Default value is 60000 ms
 atlas.graph.index.search.solr.zookeeper-session-timeout=<SolrCloud Zookeeper Session Timeout>. Default value is 60000 ms
273 274
</verbatim>

275 276
   * Restart Atlas

277 278
For more information on Titan solr configuration , please refer http://s3.thinkaurelius.com/docs/titan/0.5.4/solr.htm

279 280 281 282
Pre-requisites for running Solr in cloud mode
  * Memory - Solr is both memory and CPU intensive. Make sure the server running Solr has adequate memory, CPU and disk.
    Solr works well with 32GB RAM. Plan to provide as much memory as possible to Solr process
  * Disk - If the number of entities that need to be stored are large, plan to have at least 500 GB free space in the volume where Solr is going to store the index data
283 284
  * !SolrCloud has support for replication and sharding. It is highly recommended to use !SolrCloud with at least two Solr nodes running on different servers with replication enabled.
    If using !SolrCloud, then you also need !ZooKeeper installed and configured with 3 or 5 !ZooKeeper nodes
285

286 287 288 289 290 291 292 293 294 295
*Configuring Kafka Topics*

Atlas uses Kafka to ingest metadata from other components at runtime. This is described in the [[Architecture][Architecture page]]
in more detail. Depending on the configuration of Kafka, sometimes you might need to setup the topics explicitly before
using Atlas. To do so, Atlas provides a script =bin/atlas_kafka_setup.py= which can be run from the Atlas server. In some
environments, the hooks might start getting used first before Atlas server itself is setup. In such cases, the topics
can be run on the hosts where hooks are installed using a similar script =hook-bin/atlas_kafka_setup_hook.py=. Both these
use configuration in =atlas-application.properties= for setting up the topics. Please refer to the [[Configuration][Configuration page]]
for these details.

296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314
---++++ Setting up Atlas

There are a few steps that setup dependencies of Atlas. One such example is setting up the Titan schema
in the storage backend of choice. In a simple single server setup, these are automatically setup with default
configuration when the server first accesses these dependencies.

However, there are scenarios when we may want to run setup steps explicitly as one time operations. For example, in a
multiple server scenario using [[HighAvailability][High Availability]], it is preferable to run setup steps from one
of the server instances the first time, and then start the services.

To run these steps one time, execute the command =bin/atlas_start.py -setup= from a single Atlas server instance.

However, the Atlas server does take care of parallel executions of the setup steps. Also, running the setup steps multiple
times is idempotent. Therefore, if one chooses to run the setup steps as part of server startup, for convenience,
then they should enable the configuration option =atlas.server.run.setup.on.start= by defining it with the value =true=
in the =atlas-application.properties= file.

---++++ Starting Atlas Server

315 316 317 318 319
<verbatim>
bin/atlas_start.py [-port <port>]
</verbatim>

By default,
320
   * To change the port, use -port option.
321
   * atlas server starts with conf from {package dir}/conf. To override this (to use the same conf with multiple atlas upgrades), set environment variable ATLAS_CONF to the path of conf dir
322

323 324
---+++ Using Atlas

325
   * Quick start model - sample model and data
326
<verbatim>
327
  bin/quick_start.py [<atlas endpoint>]
328
</verbatim>
329

330 331
   * Verify if the server is up and running
<verbatim>
332 333
  curl -v http://localhost:21000/api/atlas/admin/version
  {"Version":"v0.1"}
334
</verbatim>
335

336 337
   * List the types in the repository
<verbatim>
338
  curl -v http://localhost:21000/api/atlas/types
339
  {"results":["Process","Infrastructure","DataSet"],"count":3,"requestId":"1867493731@qtp-262860041-0 - 82d43a27-7c34-4573-85d1-a01525705091"}
340
</verbatim>
341

342 343
   * List the instances for a given type
<verbatim>
344 345 346 347
  curl -v http://localhost:21000/api/atlas/entities?type=hive_table
  {"requestId":"788558007@qtp-44808654-5","list":["cb9b5513-c672-42cb-8477-b8f3e537a162","ec985719-a794-4c98-b98f-0509bd23aac0","48998f81-f1d3-45a2-989a-223af5c1ed6e","a54b386e-c759-4651-8779-a099294244c4"]}

  curl -v http://localhost:21000/api/atlas/entities/list/hive_db
348
</verbatim>
349

350 351
   * Search for entities (instances) in the repository
<verbatim>
352 353 354 355 356 357
  curl -v http://localhost:21000/api/atlas/discovery/search/dsl?query="from hive_table"
</verbatim>


*Dashboard*

358
Once atlas is started, you can view the status of atlas entities using the Web-based dashboard. You can open your browser at the corresponding port to use the web UI.
359 360


361 362
---+++ Stopping Atlas Server

363 364 365
<verbatim>
bin/atlas_stop.py
</verbatim>
366 367 368 369 370 371 372 373 374 375 376 377

---+++ Known Issues

---++++ Setup

If the setup of Atlas service fails due to any reason, the next run of setup (either by an explicit invocation of
=atlas_start.py -setup= or by enabling the configuration option =atlas.server.run.setup.on.start=) will fail with
a message such as =A previous setup run may not have completed cleanly.=. In such cases, you would need to manually
ensure the setup can run and delete the Zookeeper node at =/apache_atlas/setup_in_progress= before attempting to
run setup again.

If the setup failed due to HBase Titan schema setup errors, it may be necessary to repair the HBase schema. If no
378
data has been stored, one can also disable and drop the 'titan' schema in HBase to let setup run again.