Configuration.twiki 14.5 KB
Newer Older
1
---+ Configuring Apache Atlas - Application Properties
2

3
All configuration in Atlas uses java properties style configuration. The main configuration file is atlas-application.properties which is in the *conf* dir at the deployed location. It consists of the following sections:
4 5


6
---++ Graph Configs
7

8
---+++ Graph persistence engine
9 10 11 12 13 14

This section sets up the graph db - titan - to use a persistence engine. Please refer to
<a href="http://s3.thinkaurelius.com/docs/titan/0.5.4/titan-config-ref.html">link</a> for more
details. The example below uses BerkeleyDBJE.

<verbatim>
15
atlas.graph.storage.backend=berkeleyje
16
atlas.graph.storage.directory=data/berkeley
17 18
</verbatim>

19
---++++ Graph persistence engine - Hbase
20 21 22 23 24 25 26 27 28 29

Basic configuration

<verbatim>
atlas.graph.storage.backend=hbase
#For standalone mode , specify localhost
#for distributed mode, specify zookeeper quorum here - For more information refer http://s3.thinkaurelius.com/docs/titan/current/hbase.html#_remote_server_mode_2
atlas.graph.storage.hostname=<ZooKeeper Quorum>
</verbatim>

30 31 32 33 34 35 36
HBASE_CONF_DIR environment variable needs to be set to point to the Hbase client configuration directory which is added to classpath when Atlas starts up.
hbase-site.xml needs to have the following properties set according to the cluster setup
<verbatim>
#Set below to /hbase-secure if the Hbase server is setup in secure mode
zookeeper.znode.parent=/hbase-unsecure
</verbatim>

37 38
Advanced configuration

39
# If you are planning to use any of the configs mentioned below, they need to be prefixed with "atlas.graph." to take effect in ATLAS
40 41
Refer http://s3.thinkaurelius.com/docs/titan/0.5.4/titan-config-ref.html#_storage_hbase

42 43 44 45 46 47 48 49 50 51 52 53 54
Permissions

When Atlas is configured with HBase as the storage backend the graph db (titan) needs sufficient user permissions to be able to create and access an HBase table.  In a secure cluster it may be necessary to grant permissions to the 'atlas' user for the 'titan' table.

With Ranger, a policy can be configured for 'titan'.

Without Ranger, HBase shell can be used to set the permissions.

<verbatim>
   su hbase
   kinit -k -t <hbase keytab> <hbase principal>
   echo "grant 'atlas', 'RWXCA', 'titan'" | hbase shell
</verbatim>
55

56 57
Note that if the embedded-hbase-solr profile is used then HBase is included in the distribution so that a standalone
instance of HBase can be started as the default storage backend for the graph repository.  Using the embedded-hbase-solr
58
profile will configure Atlas so that HBase instance will be started and stopped along with the Atlas server by default.
59
To use the embedded-hbase-solr profile please see "Building Atlas" in the [[InstallationSteps][Installation Steps]]
60
section.
61

62
---+++ Graph Search Index
63
This section sets up the graph db - titan - to use an search indexing system. The example
64
configuration below sets up to use an embedded Elastic search indexing system.
65 66

<verbatim>
67 68 69 70 71
atlas.graph.index.search.backend=elasticsearch
atlas.graph.index.search.directory=data/es
atlas.graph.index.search.elasticsearch.client-only=false
atlas.graph.index.search.elasticsearch.local-mode=true
atlas.graph.index.search.elasticsearch.create.sleep=2000
72 73
</verbatim>

74
---++++ Graph Search Index - Solr
75
Please note that Solr installation in Cloud mode is a prerequisite before configuring Solr as the search indexing backend. Refer InstallationSteps section for Solr installation/configuration.
76 77 78 79 80

<verbatim>
 atlas.graph.index.search.backend=solr5
 atlas.graph.index.search.solr.mode=cloud
 atlas.graph.index.search.solr.zookeeper-url=<the ZK quorum setup for solr as comma separated value> eg: 10.1.6.4:2181,10.1.6.5:2181
81 82
 atlas.graph.index.search.solr.zookeeper-connect-timeout=<SolrCloud Zookeeper Connection Timeout>. Default value is 60000 ms
 atlas.graph.index.search.solr.zookeeper-session-timeout=<SolrCloud Zookeeper Session Timeout>. Default value is 60000 ms
83 84
</verbatim>

85 86
Also note that if the embedded-hbase-solr profile is used then Solr is included in the distribution so that a standalone
instance of Solr can be started as the default search indexing backend. Using the embedded-hbase-solr profile will
87
configure Atlas so that the standalone Solr instance will be started and stopped along with the Atlas server by default.
88
To use the embedded-hbase-solr profile please see "Building Atlas" in the [[InstallationSteps][Installation Steps]]
89
section.
90

91 92 93 94 95 96 97 98 99 100 101 102 103 104
---+++ Choosing between Persistence and Indexing Backends

Refer http://s3.thinkaurelius.com/docs/titan/0.5.4/bdb.html and http://s3.thinkaurelius.com/docs/titan/0.5.4/hbase.html for choosing between the persistence backends.
BerkeleyDB is suitable for smaller data sets in the range of upto 10 million vertices with ACID gurantees.
HBase on the other hand doesnt provide ACID guarantees but is able to scale for larger graphs. HBase also provides HA inherently.

---+++ Choosing between Persistence Backends

Refer http://s3.thinkaurelius.com/docs/titan/0.5.4/bdb.html and http://s3.thinkaurelius.com/docs/titan/0.5.4/hbase.html for choosing between the persistence backends.
BerkeleyDB is suitable for smaller data sets in the range of upto 10 million vertices with ACID gurantees.
HBase on the other hand doesnt provide ACID guarantees but is able to scale for larger graphs. HBase also provides HA inherently.

---+++ Choosing between Indexing Backends

105
Refer http://s3.thinkaurelius.com/docs/titan/0.5.4/elasticsearch.html and http://s3.thinkaurelius.com/docs/titan/0.5.4/solr.html for choosing between !ElasticSearch and Solr.
106 107 108 109
Solr in cloud mode is the recommended setup.

---+++ Switching Persistence Backend

110 111
For switching the storage backend from BerkeleyDB to HBase and vice versa, refer the documentation for "Graph Persistence Engine" described above and restart ATLAS.
The data in the indexing backend needs to be cleared else there will be discrepancies between the storage and indexing backend which could result in errors during the search.
112
!ElasticSearch runs by default in embedded mode and the data could easily be cleared by deleting the ATLAS_HOME/data/es directory.
113
For Solr, the collections which were created during ATLAS Installation - vertex_index, edge_index, fulltext_index could be deleted which will cleanup the indexes
114 115 116

---+++ Switching Index Backend

117 118 119 120
Switching the Index backend requires clearing the persistence backend data. Otherwise there will be discrepancies between the persistence and index backends since switching the indexing backend means index data will be lost.
This leads to "Fulltext" queries not working on the existing data
For clearing the data for BerkeleyDB, delete the ATLAS_HOME/data/berkeley directory
For clearing the data for HBase, in Hbase shell, run 'disable titan' and 'drop titan'
121 122


123 124 125
---++ Lineage Configs

The higher layer services like lineage, schema, etc. are driven by the type system and this section encodes the specific types for the hive data model.
126

127
# This models reflects the base super types for Data and Process
128
<verbatim>
129 130 131 132
atlas.lineage.hive.table.type.name=DataSet
atlas.lineage.hive.process.type.name=Process
atlas.lineage.hive.process.inputs.name=inputs
atlas.lineage.hive.process.outputs.name=outputs
133 134

## Schema
135
atlas.lineage.hive.table.schema.query=hive_table where name=?, columns
136 137
</verbatim>

138

139 140 141 142 143 144 145 146 147 148 149 150
---++ Search Configs
Search APIs (DSL and full text search) support pagination and have optional limit and offset arguments. Following configs are related to search pagination

<verbatim>
# Default limit used when limit is not specified in API
atlas.search.defaultlimit=100

# Maximum limit allowed in API. Limits maximum results that can be fetched to make sure the atlas server doesn't run out of memory
atlas.search.maxlimit=10000
</verbatim>


151
---++ Notification Configs
152
Refer http://kafka.apache.org/documentation.html#configuration for Kafka configuration. All Kafka configs should be prefixed with 'atlas.kafka.'
153 154 155 156 157 158 159 160 161

<verbatim>
atlas.notification.embedded=true
atlas.kafka.data=${sys:atlas.home}/data/kafka
atlas.kafka.zookeeper.connect=localhost:9026
atlas.kafka.bootstrap.servers=localhost:9027
atlas.kafka.zookeeper.session.timeout.ms=400
atlas.kafka.zookeeper.sync.time.ms=20
atlas.kafka.auto.commit.interval.ms=1000
162
atlas.kafka.hook.group.id=atlas
163 164
</verbatim>

165 166 167 168 169 170
Note that Kafka group ids are specified for a specific topic.  The Kafka group id configuration for entity notifications is 'atlas.kafka.entities.group.id'

<verbatim>
atlas.kafka.entities.group.id=<consumer id>
</verbatim>

171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186
These configuration parameters are useful for setting up Kafka topics via Atlas provided scripts, described in the
[[InstallationSteps][Installation Steps]] page.

<verbatim>
# Whether to create the topics automatically, default is true.
# Comma separated list of topics to be created, default is "ATLAS_HOOK,ATLAS_ENTITES"
atlas.notification.topics=ATLAS_HOOK,ATLAS_ENTITIES
# Number of replicas for the Atlas topics, default is 1. Increase for higher resilience to Kafka failures.
atlas.notification.replicas=1
# Enable the below two properties if Kafka is running in Kerberized mode.
# Set this to the service principal representing the Kafka service
atlas.notification.kafka.service.principal=kafka/_HOST@EXAMPLE.COM
# Set this to the location of the keytab file for Kafka
#atlas.notification.kafka.keytab.location=/etc/security/keytabs/kafka.service.keytab
</verbatim>

187 188 189 190 191 192 193 194 195
These configuration parameters are useful for saving messages in case there are issues in reaching Kafka for
sending messages.

<verbatim>
# Whether to save messages that failed to be sent to Kafka, default is true
atlas.notification.log.failed.messages=true
# If saving messages is enabled, the file name to save them to. This file will be created under the log directory of the hook's host component - like HiveServer2
atlas.notification.failed.messages.filename=atlas_hook_failed_messages.log
</verbatim>
196

197
---++ Client Configs
198 199 200
<verbatim>
atlas.client.readTimeoutMSecs=60000
atlas.client.connectTimeoutMSecs=60000
201
atlas.rest.address=<http/https>://<atlas-fqdn>:<atlas port> - default http://localhost:21000
202 203 204
</verbatim>


205
---++ Security Properties
206

207
---+++ SSL config
208 209 210
The following property is used to toggle the SSL feature.

<verbatim>
211
atlas.enableTLS=false
212 213
</verbatim>

214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
---++ High Availability Properties

The following properties describe High Availability related configuration options:

<verbatim>
# Set the following property to true, to enable High Availability. Default = false.
atlas.server.ha.enabled=true

# Define a unique set of strings to identify each instance that should run an Atlas Web Service instance as a comma separated list.
atlas.server.ids=id1,id2
# For each string defined above, define the host and port on which Atlas server binds to.
atlas.server.address.id1=host1.company.com:21000
atlas.server.address.id2=host2.company.com:31000

# Specify Zookeeper properties needed for HA.
# Specify the list of services running Zookeeper servers as a comma separated list.
atlas.server.ha.zookeeper.connect=zk1.company.com:2181,zk2.company.com:2181,zk3.company.com:2181
# Specify how many times should connection try to be established with a Zookeeper cluster, in case of any connection issues.
atlas.server.ha.zookeeper.num.retries=3
# Specify how much time should the server wait before attempting connections to Zookeeper, in case of any connection issues.
atlas.server.ha.zookeeper.retry.sleeptime.ms=1000
# Specify how long a session to Zookeeper should last without inactiviy to be deemed as unreachable.
atlas.server.ha.zookeeper.session.timeout.ms=20000

# Specify the scheme and the identity to be used for setting up ACLs on nodes created in Zookeeper for HA.
# The format of these options is <scheme>:<identity>. For more information refer to http://zookeeper.apache.org/doc/r3.2.2/zookeeperProgrammers.html#sc_ZooKeeperAccessControl.
# The 'acl' option allows to specify a scheme, identity pair to setup an ACL for.
241
atlas.server.ha.zookeeper.acl=sasl:client@comany.com
242 243 244 245 246 247 248 249 250 251 252 253 254
# The 'auth' option specifies the authentication that should be used for connecting to Zookeeper.
atlas.server.ha.zookeeper.auth=sasl:client@company.com

# Since Zookeeper is a shared service that is typically used by many components,
# it is preferable for each component to set its znodes under a namespace.
# Specify the namespace under which the znodes should be written. Default = /apache_atlas
atlas.server.ha.zookeeper.zkroot=/apache_atlas

# Specify number of times a client should retry with an instance before selecting another active instance, or failing an operation.
atlas.client.ha.retries=4
# Specify interval between retries for a client.
atlas.client.ha.sleep.interval.ms=5000
</verbatim>
255 256 257 258 259 260 261

---++ Server Properties

<verbatim>
# Set the following property to true, to enable the setup steps to run on each server start. Default = false.
atlas.server.run.setup.on.start=false
</verbatim>
262 263 264 265 266 267 268 269 270 271 272 273 274 275

---++ Performance configuration items

The following properties can be used to tune performance of Atlas under specific circumstances:

<verbatim>
# The number of times Atlas code tries to acquire a lock (to ensure consistency) while committing a transaction.
# This should be related to the amount of concurrency expected to be supported by the server. For e.g. with retries set to 10, upto 100 threads can concurrently create types in the Atlas system.
# If this is set to a low value (default is 3), concurrent operations might fail with a PermanentLockingException.
atlas.graph.storage.lock.retries=10

# Milliseconds to wait before evicting a cached entry. This should be > atlas.graph.storage.lock.wait-time x atlas.graph.storage.lock.retries
# If this is set to a low value (default is 10000), warnings on transactions taking too long will occur in the Atlas application log.
atlas.graph.storage.cache.db-cache-time=120000
276 277 278 279 280 281 282 283 284 285 286 287

# Minimum number of threads in the atlas web server
atlas.webserver.minthreads=10

# Maximum number of threads in the atlas web server
atlas.webserver.maxthreads=100

# Keepalive time in secs for the thread pool of the atlas web server
atlas.webserver.keepalivetimesecs=60

# Queue size for the requests(when max threads are busy) for the atlas web server
atlas.webserver.queuesize=100
288 289 290 291 292 293 294 295 296 297 298 299 300 301
</verbatim>

---+++ Recording performance metrics

Atlas package should be built with '-P perf' to instrument atlas code to collect metrics. The metrics will be recorded in
<atlas.log.dir>/metric.log, with one log line per API call. The metrics contain the number of times the instrumented methods
are called and the total time spent in the instrumented method. Logging to metric.log is controlled through log4j configuration
in atlas-log4j.xml. When the atlas code is instrumented, to disable logging to metric.log at runtime, set log level of METRICS logger to info level:
<verbatim>
<logger name="METRICS" additivity="false">
    <level value="info"/>
    <appender-ref ref="METRICS"/>
</logger>
</verbatim>