Commit 92cdc6a9 by Ashutosh Mestry

ATLAS-2229: Advanced Search: Documentation update.

parent 4e8e9ca8
...@@ -83,7 +83,7 @@ ...@@ -83,7 +83,7 @@
</dependency> </dependency>
</dependencies> </dependencies>
<configuration> <configuration>
<port>8080</port> <port>8888</port>
</configuration> </configuration>
<executions> <executions>
<execution> <execution>
......
---+ Advanced Search
---+++ Background
Advanced Search in Atlas is also referred to as DSL-based Search.
Domain Specific Search (DSL) is a language with simple constructs that help users navigate Atlas data repository. The syntax loosely emulates the popular Structured Query Language (SQL) from relation database world.
Benefits of DSL:
* Abstracts the implementation-level database constructs. This avoids the necessity of knowing about the underlying graph database constructs.
* User are provided with an abstraction that helps them retrieve the data by just being aware of the types and their relationships within their dataset.
* Allows for a way to specify the desired output.
* Use of classifications is accounted for in the syntax.
* Provides way to group and aggregate results.
We will be using the quick start dataset in the examples that follow. This dataset is comprehensive enough to be used to to demonstrate the various features of the language.
For details on the grammar, please refer to Atlas DSL Grammer on [[https://github.com/apache/atlas/blob/master/repository/src/main/java/org/apache/atlas/query/antlr4/AtlasDSLParser.g4][Github]] (Antlr G4 format).
---++ Using Advanced Search
Within the Atlas UI, select Advanced in the Search pane on the left.
Notice that the _Favorite Searches_ pane below the _Search By Query_ box. Like _Basic Search_, it is possible to save the _Advanced Searches_ as well.
---++ Introduction to Domain Specific Language
DSL uses the familiar SQL-like syntax.
At a high-level a query has a _from-where-select_ format. Additional keywords like _grouby_, _orderby_, _limit_ can be used to added to affect the output. We will see examples of these below.
---+++ From Clause
Specifying the _from_ clause is mandatory. Using the _from_ keyword itself is optional. The value specified in the _from_ clause acts as the source or starting point for the rest of the query to source its inputs.
Example: To retrieve all entities of type _DB_:
<verbatim>
DB
from DB</verbatim>
In the absence of _where_ for filtering on the source, the dataset fetched by the _from_ clause is everything from the database. Based on the size of the data present in the database, there is a potential to overwhelm the server. The query processor thus adds _limit_ clause with a default value set. See the section on _limit_ clause for details.
---+++ Where Clause
The _where_ clause allows for filtering over the dataset. This achieved by using conditions within the where clause.
A conditions is identifier followed by an operator followed by a literal. Literal must be enclosed in single or double quotes. Example, _name = "Sales"_. An identifier can be name of the property of the type specified in the _from_ clause or an alias.
Example: To retrieve entity of type _Table_ with a specific name say time_dim:
<verbatim>
from Table where name = 'time_dim'</verbatim>
It is possible to specify multiple conditions by combining them using _and_, _or_ operators.
Example: To retrieve entity of type Table with name that can be either time_dim or customer_dim:
<verbatim>
from Table where name = 'time_dim' or name = 'customer_dim'</verbatim>
Filtering based on a list of values is done using by specifying the values in the square brackets. A value array is a list of values enclosed within square brackets. This is a simple way to specify an OR clause on an identifier.
Note that having several OR clauses on the same attribute may be inefficient. Alternate way is to use the value array as shown in the example below.
Example: The query in the example above can be written using a value array as shown below.
<verbatim>
from Table where name = ["customer_dim", "time_dim"]</verbatim>
A condition that uses the LIKE operator, allows for filtering using wildcards like '*' or '?'.
Example: To retrieve entity of type _Table_ whose name ends with '_dim':
<verbatim>
from Table where name LIKE '*_dim'</verbatim>
Additional forms of regular expressions can also be used.
Example: To retrieve _DB_ whose name starts with _R_ followed by has any 3 characters, followed by _rt_ followed by at least 1 character, followed by none or any number of characters.
<verbatim>
DB where name like "R???rt?*"</verbatim>
---++++ Using Date Literals
Dates used in literals need to be specified using the ISO 8601 format.
Dates in this format follow this notation:
* _yyyy-MM-ddTHH:mm:ss.SSSZ_. Which means, year-month-day followed by time in hour-minutes-seconds-milli-seconds. Date and time need to be separated by 'T'. It should end with 'Z'.
* _yyyy-MM-dd_. Which means, year-month-day.
Example: Date represents December 11, 2017 at 2:35 AM.
<verbatim>2017-12-11T02:35:0.0Z</verbatim>
Example: To retrieve entity of type _Table_ created within 2017 and 2018.
<verbatim>
from Table where createTime < '2018-01-01' and createTime > '2017-01-01'</verbatim>
---++++ Using Boolean Literals
Properties of entities of type boolean can be used within queries.
Eample: To retrieve entity of type hdfs_path whose attribute _isFile_ is set to _true_ and whose name is _Invoice_.
<verbatim>
from hdfs_path where isFile = true or name = "Invoice"</verbatim>
Valid values for boolean literals are 'true' and 'false'.
---+++ Existence of a Property
The has keyword can be used with or without the where clause. It is used to check existence of a property in an entity.
Example: To retreive entity of type Table with a property locationUri.
<verbatim>
Table has locationUri
from Table where Table has locationUri
</verbatim>
---+++ Select Clause
If you noticed the output displayed on the web page, it displays a tabular display, each row corresponding to an entity and columns are properties of that entity. The select clause allows for choosing the properties of entity that are of interest.
Example: To retrieve entity of type _Table_ with few properties:
<verbatim>
from Table select owner, name, qualifiedName
</verbatim>
Example: To retrieve entity of type Table for a specific table with some properties.
<verbatim>
from Table where name = 'customer_dim' select owner, name, qualifiedName</verbatim>
To display column headers that are more meaningful, aliases can be added using the 'as' clause.
Example: To display column headers as 'Owner', 'Name' and 'FullName'.
<verbatim>
from Table select owner as Owner, name as Name, qualifiedName as FullName</verbatim>
---++++ Note About Select Clauses
Given the complexity involved in using select clauses, these are the few rules to remember when using select clauses:
* Works with all immediate attributes.
* Works with Immediate attributes and aggregation on immediate attributes.
* Referred attributes cannot be mixed with immediate attributes.
Example: To retrieve entity of type Table with name 'Sales' and display 'name' and 'owner' attribute of the referred entity DB.
<verbatim>
Table where name = 'abcd' select DB.name, DB.owner</verbatim>
Current implementation does not allow the following:
<verbatim>
Table where name = 'abcd' select DB.name, Table.name</verbatim>
---+++ Classification-based Filtering
In order to retrieve entities based on classification, a query would use _is_ or _isa_ keywords.
Example: To retrieve all entities of type _Table_ that are tagged with _Dimension_ classification.
<verbatim>
from Table isa Dimension</verbatim>
Since, from is optional and _is_ (or _isa_) are equivalent, the following queries yield the same results:
<verbatim>
Table is Dimension</verbatim>
The _is_ and _isa_ clauses can also be used in _where_ condition like:
<verbatim>
from Table where Table isa Dimension</verbatim>
To search for all entities having a particular classification, simply use the name of the classification.
Example: To retrieve all entities that have _Dimension_ classification.
<verbatim>
Dimension</verbatim>
---+++ Limit & Offset Clauses
Often a query yields large number of results. To limit the outcome of the query, the limit and offset clauses are used.
Example: To retrieve only the 5 entities from a result set.
<verbatim>
Column limit 5</verbatim>
The offset clauses retrieves results after the offset value.
Example: To retrieve only 5 entities from the result set after skipping the first 10.
<verbatim>
Column limit 5 offset 10</verbatim>
The _limit_ and _offset_ clauses are usually specified in conjunction.
If no limit clause is specified in the query, a limit clause with a default limit (usually 100) is added to the query. This prevents the query from inadvertently fetching large number of results.
The _offset_ clause is useful for displaying results in a user interface where few results from the result set are showing and more results are fetched as the user advances to next page.
---+++ Ordering Results
The _orderby_ clause allows for sorting of results. Results are sorted in ascending order by default. Only immediate attributes can be used within this clause.
Ordering can be changed by using:
* ASC Sort in ascending order. This is the default. If no ordering is specified after the _orderby_ clause.
* DESC Sort in descending order. This needs to be explicitly specified after the _orderby_ clause.
Example: To retrieve the entities of type _Column_ that are sorted in ascending order using the name property.
<verbatim>
from Column orderby name
from Column orderby name asc</verbatim>
Example: Same results as above except that they are sorted in descending order.
<verbatim>
from Column orderby name desc</verbatim>
---+++ Aggregate Functions
Let's look at aggregate functions:
* _sum_: Adds (sums up) a value of the property specified, within the result set.
* _min_: Finds the minimum value of the property specified, within a result set.
* _max_: Finds the maximum value of the property specified, within a result set.
* _count_: Finds the number of items specified by the group by clause.
These work only on immediate attributes.
Other examples of these in the _Grouping Results_ section.
---++++ The count Keyword
Shows the number of items in a result set.
Example: To know how may entities of a type Column.
<verbatim>
Column select count()</verbatim>
Example: Same as above with alias.
<verbatim>
Column select count() as Cols</verbatim>
Example: To find the number of tables in a database.
<verbatim>
Table where db.name = "Reporting" select count()</verbatim>
---++++ The max Keyword
Using this keyword it is possible to retrieve the maximum value of a property for an entity.
Example: Get the most recently created value of the _createTime_ property of the Table entity.
<verbatim>
Table select max(createTime)</verbatim>
---++++ The min Keyword
Using this keyword it is possible to retrieve the minimum value of a property for an entity.
Example: Get the least recently created value of the _createTime_ property of the Table entity.
<verbatim>
Table select min(createTime)</verbatim>
---+++ Grouping Results
The _groupby_ clause groups results within the result using specified property.
Example: To retrieve entity of type Table such that tables belonging to an owner are together (grouped by owner).
<verbatim>
Table groupby(owner)</verbatim>
While _groupby_ can work without _select_, if aggregate functions are used within _select_ clause, using _groupby_ clause becomes mandatory as aggregate functions operate on a group.
Example: To retrieve entity of type Table such we know the most recently created entity.
<verbatim>
Table groupby(createTime) select owner, name, max(createTime)</verbatim>
Example: To retrieve entity of type Table such we know the oldest entity.
<verbatim>
Table groupby(createTime) select owner, name, min(createTime)</verbatim>
Example: To know the number of entities owned by each owner.
<verbatim>
Table groupby(owner) select owner, count()</verbatim>
---+++ Where Clause With Complex Types
In the discussion so far we looked at where clauses with primitive types. This section will look at using properties that are non-primitive types.
In this model, the DB is modeled such that it is aware of all the Table it contains. Table on the other hand is aware of existence of the DB but is not aware of all the other _Table_ instances within the system. Each Table maintains reference of the _DB_ it belongs to.
Similar structure exists within the _hive_ data model.
Example: To retrieve all the instances of the _Table_ belonging to a database named 'Sales':
<verbatim>
Table where db.name = "Sales"</verbatim>
The entity Column is modeled in a similar way. Each Table entity has outward edges pointing to Column entity instances corresponding to each column within the table.
Example: To retrieve all the Column entities for a given Table.
<verbatim>
Table where name = "time_dim" select columns</verbatim>
The propeties of each _Column_ entity type are displayed.
---+++ Using System Attributes
Each type defined within Atlas gets few attributes by default. These attributes help with internal book keeping of the entities. All the system attributes are prefixed with '__' (double underscore). This helps in identifying them from other attributes.
Following are the system attributes:
* __guid Each entity within Atlas is assigned a globally unique identifier (GUID for short).
* __modifiedBy Name of the user who last modified the entity.
* __createdBy Name of the user who created the entity.
* __state Current state of the entity. Please see below for details.
* __timestamp Timestamp (date represented as integer) of the entity at the time of creation.
* __modificationTimestamp Timestamp (date represented as integer) of the entity at the time of last modification.
---++++ State of an Entity
Entity within Atlas can be in the following states:
* ACTIVE This is the state of entities that when it is available and is used within the system. It can be retrieved by default by searches.
* DELETED When an entity is deleted, it's state is marked as DELETED. Entity in this state does not show up in search results. Explicit request needs to be made to retrieve this entity.
---++++ Using System Attributes in Queries
Example: To retrieve all entities that are deleted.
<verbatim>
Asset where __state = "DELETED"</verbatim>
Example: To retrieve entity GUIDs.
<verbatim>
Table select __guid</verbatim>
Example: To retrieve several system attributes.
<verbatim>
hive_db select __timestamp, __modificationTimestamp, __state, __createdBy</verbatim>
---++ Advanced Search REST API
Relevant models for these operations:
* _[[https://github.com/apache/atlas/blob/master/intg/src/main/java/org/apache/atlas/model/discovery/AtlasSearchResult.java][AtlasSearchResult]]_
* _[[https://github.com/apache/atlas/blob/master/intg/src/main/java/org/apache/atlas/exception/AtlasBaseException.java][AtlasBaseException]]_
---+++ The V2 API
|*Get Results using DSL Search*||
| _Example_ | See Examples sections below. |
|_URL_|_api/atlas/v2/search/dsl_|
|_Method_|_GET_|
|_URL Parameters_|_query_: Query conforming to DSL syntax.|
||_typeName_: Type name of the entity to be retrived.|
||_classification_: Classification associated with the type or query.|
||_limit_: Maximum number of items in the result set.|
||_offset_: Starting index of the item in the result set.|
|_Data Parameters_|_None_|
|_Success Response_|The JSON will correspond to [[https://github.com/apache/atlas/blob/master/intg/src/main/java/org/apache/atlas/model/discovery/AtlasSearchResult.java][AtlasSearchResult]].|
|_Error Response_|Errors that are handled within the system will be returned as [[https://github.com/apache/atlas/blob/master/intg/src/main/java/org/apache/atlas/exception/AtlasBaseException.java][AtlasBaseException]].|
|_Method Signature_|@GET|
||@Path("/dsl")|
||@Consumes(Servlets.JSON_MEDIA_TYPE)|
||@Produces(Servlets.JSON_MEDIA_TYPE)|
*Examples*
<verbatim>
curl -X GET -u admin:admin -H "Content-Type: application/json" "http://localhost:21000/api/atlas/v2/search/dsl?typeName=Table"
curl -X GET -u admin:admin -H "Content-Type: application/json" "http://localhost:21000/api/atlas/v2/search/dsl?typeName=Column&classification=PII"
curl -X GET -u admin:admin -H "Content-Type: application/json" "http://localhost:21000/api/atlas/v2/search/dsl?typeName=Table&classification=Dimension&limit=10&offset=2"
curl -X GET -u admin:admin -H "Content-Type: application/json" "http://localhost:21000/api/atlas/v2/search/dsl?query=Table%20isa%20Dimension"
curl -X GET -u admin:admin -H "Content-Type: application/json" "http://localhost:21000/api/atlas/v2/search/dsl?query=Table%20isa%20Dimension&limit=5&offset=2"</verbatim>
---++ Implementation Approach
The general approach followed in implementation of DSL within Atlas can be enumerated in following steps:
* Parser parses the incoming query for syntax.
* Abstract syntax tree is generated by for a query that is parsed successfully.
* Syntax tree is 'walked' using visitor pattern.
* Each 'visit' within the tree adds a step in the Gremlin pipeline.
* When done, the generated script is executed using Gremlin Script Engine.
* Results generated be the query, if any, are processed and packaged in AtlasSearchResult structure.
---++ Differences Between Master and Earlier Versions
The following clauses are no longer supported:
* path
* loop
---++ Resources
* Antlr [[https://pragprog.com/book/tpantlr2/the-definitive-antlr-4-reference][Book]].
* Antlr [[https://github.com/antlr/antlr4/blob/master/doc/getting-started.md][Quick Start]].
* Atlas DSL Grammar on [[https://github.com/apache/atlas/blob/master/repository/src/main/java/org/apache/atlas/query/antlr4/AtlasDSLParser.g4][Github]] (Antlr G4 format).
---+ Search ---+ Basic Search
Atlas exposes search over the metadata in two ways:
* Basic Search
* Advanced Search (DSL or Full-Text)
---++ Basic search
The basic search allows you to query using typename of an entity, associated classification/tag The basic search allows you to query using typename of an entity, associated classification/tag
and has support for filtering on the entity attribute(s) as well as the classification/tag attributes. and has support for filtering on the entity attribute(s) as well as the classification/tag attributes.
...@@ -166,138 +161,3 @@ __CURL Samples__ ...@@ -166,138 +161,3 @@ __CURL Samples__
}' }'
<protocol>://<atlas_host>:<atlas_port>/api/atlas/v2/search/basic <protocol>://<atlas_host>:<atlas_port>/api/atlas/v2/search/basic
</verbatim> </verbatim>
---++ Advanced Search
---+++ Search DSL Grammar
The DSL exposes an SQL like query language for searching the metadata based on the type system.
The grammar for the DSL is below.
<verbatim>
queryWithPath: query ~ opt(WITHPATH)
query: querySrc ~ opt(loopExpression) ~ opt(groupByExpr) ~ opt(selectClause) ~ opt(orderby) ~ opt(limitOffset)
querySrc: rep1sep(singleQrySrc, opt(COMMA))
singleQrySrc = FROM ~ fromSrc ~ opt(WHERE) ~ opt(expr ^? notIdExpression) |
WHERE ~ (expr ^? notIdExpression) |
expr ^? notIdExpression |
fromSrc ~ opt(WHERE) ~ opt(expr ^? notIdExpression)
fromSrc: identifier ~ AS ~ alias | identifier
groupByExpr = GROUPBY ~ (LPAREN ~> rep1sep(selectExpression, COMMA) <~ RPAREN)
orderby: ORDERBY ~ expr ~ opt (sortOrder)
limitOffset: LIMIT ~ lmt ~ opt (offset)
offset: OFFSET ~ offsetValue
sortOrder = ASC | DESC
loopExpression: LOOP ~ (LPAREN ~> query <~ RPAREN) ~ opt(intConstant <~ TIMES) ~ opt(AS ~> alias)
selectClause: SELECT ~ rep1sep(selectExpression, COMMA)
countClause = COUNT ~ LPAREN ~ RPAREN
maxClause = MAX ~ (LPAREN ~> expr <~ RPAREN)
minClause = MIN ~ (LPAREN ~> expr <~ RPAREN)
sumClause = SUM ~ (LPAREN ~> expr <~ RPAREN)
selectExpression: expr ~ opt(AS ~> alias)
expr: compE ~ opt(rep(exprRight))
exprRight: (AND | OR) ~ compE
compE:
arithE ~ (LT | LTE | EQ | NEQ | GT | GTE) ~ arithE |
arithE ~ (ISA | IS) ~ ident |
arithE ~ HAS ~ ident |
arithE | countClause | maxClause | minClause | sumClause
arithE: multiE ~ opt(rep(arithERight))
arithERight: (PLUS | MINUS) ~ multiE
multiE: atomE ~ opt(rep(multiERight))
multiERight: (STAR | DIV) ~ atomE
atomE: literal | identifier | LPAREN ~> expr <~ RPAREN
identifier: rep1sep(ident, DOT)
alias: ident | stringLit
literal: booleanConstant |
intConstant |
longConstant |
floatConstant |
doubleConstant |
stringLit
</verbatim>
Grammar language:
{noformat}
opt(a) => a is optional
~ => a combinator. 'a ~ b' means a followed by b
rep => zero or more
rep1sep => one or more, separated by second arg.
{noformat}
Language Notes:
* A *!SingleQuery* expression can be used to search for entities of a _Trait_ or _Class_.
Entities can be filtered based on a 'Where Clause' and Entity Attributes can be retrieved based on a 'Select Clause'.
* An Entity Graph can be traversed/joined by combining one or more !SingleQueries.
* An attempt is made to make the expressions look SQL like by accepting keywords "SELECT",
"FROM", and "WHERE"; but these are optional and users can simply think in terms of Entity Graph Traversals.
* The transitive closure of an Entity relationship can be expressed via the _Loop_ expression. A
_Loop_ expression can be any traversal (recursively a query) that represents a _Path_ that ends in an Entity of the same _Type_ as the starting Entity.
* The _!WithPath_ clause can be used with transitive closure queries to retrieve the Path that
connects the two related Entities. (We also provide a higher level interface for Closure Queries
see scaladoc for 'org.apache.atlas.query.ClosureQuery')
* GROUPBY is optional. Group by can be specified with aggregate methods like max, min, sum and count. When group by is specified aggregated results are returned based on the method specified in select clause. Select expression is mandatory with group by expression.
* ORDERBY is optional. When order by clause is specified, case insensitive sorting is done based on the column specified.
For sorting in descending order specify 'DESC' after order by clause. If no order by is specified, then no default sorting is applied.
* LIMIT is optional. It limits the maximum number of objects to be fetched starting from specified optional offset. If no offset is specified count starts from beginning.
* There are couple of Predicate functions different from SQL:
* _is_ or _isa_can be used to filter Entities that have a particular Trait.
* _has_ can be used to filter Entities that have a value for a particular Attribute.
* Any identifiers or constants with special characters(space,$,",{,}) should be enclosed within backquote (`)
---++++ DSL Examples
For the model,
Asset - attributes name, owner, description
DB - supertype Asset - attributes clusterName, parameters, comment
Column - extends Asset - attributes type, comment
Table - supertype Asset - db, columns, parameters, comment
Traits - PII, Log Data
DSL queries:
* from DB
* DB where name="Reporting" select name, owner
* DB where name="Reporting" select name, owner orderby name
* DB where name="Reporting" select name limit 10
* DB where name="Reporting" select name, owner limit 10 offset 0
* DB where name="Reporting" select name, owner orderby name limit 10 offset 5
* DB where name="Reporting" select name, owner orderby name desc limit 10 offset 5
* DB has name
* DB is !JdbcAccess
* Column where Column isa PII
* Table where name="sales_fact", columns
* Table where name="sales_fact", columns as column select column.name, column.dataType, column.comment
* DB groupby(owner) select owner, count()
* DB groupby(owner) select owner, max(name)
* DB groupby(owner) select owner, min(name)
* from Person select count() as 'count', max(Person.age) as 'max', min(Person.age)
* `Log Data`
---+++ Full-text Search
Atlas also exposes a lucene style full-text search capability.
\ No newline at end of file
...@@ -50,7 +50,8 @@ capabilities around these data assets for data scientists, analysts and the data ...@@ -50,7 +50,8 @@ capabilities around these data assets for data scientists, analysts and the data
* [[Architecture][High Level Architecture]] * [[Architecture][High Level Architecture]]
* [[TypeSystem][Type System]] * [[TypeSystem][Type System]]
* [[Search][Search]] * [[Search - Basic][Basic Search]]
* [[Search - Advanced][Advanced Search]]
* [[security][Security]] * [[security][Security]]
* [[Authentication-Authorization][Authentication and Authorization]] * [[Authentication-Authorization][Authentication and Authorization]]
* [[Configuration][Configuration]] * [[Configuration][Configuration]]
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment