1. Elasticsearch介绍
Elasticsearch是一个文档型的NoSQL数据库, 主要用于:
-
实时全文搜索
-
日志管理
-
数据分析
-
应用监控
2. 安装
2.1. 初始化数据目录
sudo mkdir -p ~/volumes/es/data0{1..3}/
sudo chmod 777 ~/volumes/es/ -R
echo "vm.max_map_count=262144" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
2.2. docker方式启动
version: '3'
services:
es01:
image: elasticsearch:7.4.2
container_name: es01
environment:
- node.name=es01
- discovery.seed_hosts=es02,es03
- cluster.initial_master_nodes=es01
- cluster.name=docker-cluster
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms4g -Xmx4g"
- ELASTIC_PASSWORD=sonic333
- xpack.security.enabled=true
- TZ=Asia/Shanghai
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- ~/volumes/es/data01:/usr/share/elasticsearch/data
ports:
- 19200:9200
networks:
- esnet
es02:
image: elasticsearch:7.4.2
container_name: es02
environment:
- node.name=es02
- discovery.seed_hosts=es01,es03
- cluster.initial_master_nodes=es01
- cluster.name=docker-cluster
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms4g -Xmx4g"
- ELASTIC_PASSWORD=sonic333
- xpack.security.enabled=true
- TZ=Asia/Shanghai
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- ~/volumes/es/data02:/usr/share/elasticsearch/data
networks:
- esnet
es03:
image: elasticsearch:7.4.2
container_name: es03
environment:
- node.name=es03
- discovery.seed_hosts=es01,es02
- cluster.initial_master_nodes=es01
- cluster.name=docker-cluster
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms4g -Xmx4g"
- ELASTIC_PASSWORD=sonic333
- xpack.security.enabled=true
- TZ=Asia/Shanghai
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- ~/volumes/es/data03:/usr/share/elasticsearch/data
networks:
- esnet
kibana:
container_name: kibana
image: kibana:7.4.2
environment:
- xpack.security.enabled=true
- ELASTICSEARCH_HOSTS=http://es01:9200
- ELASTICSEARCH_USERNAME=elastic
- ELASTICSEARCH_PASSWORD=sonic333
- XPACK_GRAPH_ENABLED=true
- TIMELION_ENABLED=true
- XPACK_MONITORING_COLLECTION_ENABLED=true
- SERVER_HOST=0.0.0.0
- LOGGING_TIMEZONE=Asia/Shanghai
- TZ=Asia/Shanghai
ports:
- 5601:5601
networks:
- esnet
networks:
esnet:
docker-compose up -d
3. 基本概念
3.1. Index
表示一类文档的集合.
比如用户索引, 商品索引等.
索引由一个全小写的名称标识, 对文档的CRUD操作均需指定索引名称.
每一个索引都有自己的Mapping, 定义了该索引下文档的字段名和字段类型.
RDBMS | Elasticsearch |
---|---|
Table |
Index |
Row |
Document |
Column |
Field |
Schema |
Mapping |
SQL |
DSL |
3.2. Document
文档以JSON形式存在, 一个文档表示一条数据.
JSON中每个字段都有对应的字段类型 (boolean/number/string/date/binary/range等)
, 字段类型可以自己指定或者Elasticsearch自动推断.
每个文档都有一个ID, 可以自己指定, 或者Elasticsearch自动生成.
每个文档都有一些元数据字段:
-
_index
: 文档所属的索引名 -
_id
: 文档id -
_source
: 文档原始JSON数据 -
_version
: 文档版本号 -
_score
: 文档搜索时的评分
3.3. Term
多个单词形成 OR
查询, 不要求顺序.
比如 "Mac OS" 会查询某个字段存在 Mac
或 OS
的文档.
3.4. Phrase
多个单词形成短语, 作为一个整体查询.
比如 "Mac OS" 会查询某个字段存在 Mac OS
字符串的文档.
3.5. 倒排索引
倒排索引由两部分组成:
-
单词词典: 记录所有单词, 单词到倒排列表的关系.
-
倒排列表: 记录了单词所处文档的信息:
-
文档id
-
词频
(TF)
: 记录单词在文档中出现的次数, 用于相关性评分. -
位置
(Position)
: 记录单词在文档中分词的位置, 用于语句搜索. -
偏移
(Offset)
: 记录单词的开始和结束位置, 用于高亮显示.
-
3.6. 倒排索引例子
文档id | 文档内容 |
---|---|
1 |
Mastering Elasticsearch |
2 |
Elasticsearch Server |
3 |
Elasticsearch Essentials |
文档id | TF | Position | Offset |
---|---|---|---|
1 |
1 |
1 |
<10,23> |
2 |
1 |
0 |
<0,13> |
3 |
1 |
0 |
<0,13> |
3.7. Analysis
文本分析: 把全文本转换为一系列单词的过程, 也叫 分词 .
Elasticsearch通过 Analyser 实现分词.
Analyser由三部分组成:
-
Character filter: 处理原始文本, 比如去除html标签, unicode字符转换等.
-
Tokenizer: 将文本切分为多个单词.
-
Token filter: 处理切分后的单词, 比如转小写.
POST /_analyze
{
"tokenizer": "standard",
"char_filter": [{
"type": "mapping",
"mappings":["- => _"]
}],
"text": ["123-345"]
}
3.8. 集群
3.8.1. 节点
每个Elasticsearch实例是一个节点, 节点分为:
-
master节点: 可以被选举为master的节点, 执行维护集群状态, 创建/删除索引等操作
node.master= true
node.data= false
node.ingest= false
cluster.remote.connect= false
可以设置 node.voting_only: true
表示该节点只参与master选举, 但不会成为master.
-
data节点: 处理数据相关操作,比如 CRUD, 全文搜索, 聚合查询等
node.master= false
node.data= true
node.ingest= false
cluster.remote.connect= false
-
ingest节点: 负责执行ingest pipeline的节点
node.master= false
node.data= false
node.ingest= true
cluster.remote.connect= false
-
coordinating节点: 将请求路由分发到其他节点.
node.master= false
node.data= false
node.ingest= false
cluster.remote.connect= false
-
ml节点: 负责执行机器学习Job/处理机器学习请求的节点
node.ml= true
xpack.ml.enabled= true
4. 索引操作API
4.1. 创建索引
PUT /page_view_info
{
"settings": {
"number_of_shards": 2
},
"mappings": {
"properties": {
"id": {
"type": "long"
},
"companyId": {
"type": "integer"
},
"landingPageId": {
"type": "integer"
},
"formId": {
"type": "integer"
},
"ip": {
"type": "keyword"
},
"province": {
"type": "keyword"
},
"city": {
"type": "keyword"
},
"location": {
"type": "keyword"
},
"ua": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"device": {
"type": "keyword"
},
"os": {
"type": "keyword"
},
"browser": {
"type": "keyword"
},
"uid": {
"type": "keyword"
},
"sid": {
"type": "keyword"
},
"pid": {
"type": "keyword"
},
"referrer": {
"type": "keyword"
},
"origin": {
"type": "keyword"
},
"osType": {
"type": "keyword"
},
"browserType": {
"type": "keyword"
},
"networkType": {
"type": "keyword"
},
"medium": {
"type": "keyword"
},
"channel": {
"type": "keyword"
},
"priceRange": {
"type": "keyword"
},
"screenSize": {
"type": "keyword"
},
"screenDpi": {
"type": "keyword"
},
"brand": {
"type": "keyword"
},
"longitude": {
"type": "keyword"
},
"latitude": {
"type": "keyword"
},
"url": {
"type": "keyword"
},
"lengthOfStay": {
"type": "double"
},
"accessDepth": {
"type": "double"
},
"fill": {
"type": "byte"
},
"submitDataType": {
"type": "byte"
},
"submitDataId": {
"type": "integer"
},
"content": {
"type": "keyword"
},
"fillContent": {
"type": "keyword"
},
"fillAt": {
"type": "date"
},
"year": {
"type": "short"
},
"month": {
"type": "byte"
},
"day": {
"type": "byte"
},
"hour": {
"type": "byte"
},
"minute": {
"type": "byte"
},
"createdAt": {
"type": "date"
},
"updatedAt": {
"type": "date"
},
"auditable": {
"type": "boolean"
},
"clickId": {
"type": "keyword"
},
"telgetIsAutoTel": {
"type": "boolean"
},
"telgetType": {
"type": "byte"
},
"submitDataExt": {
"type": "object"
}
}
}
}
4.2. 查看索引详情
GET /page_view_info
4.3. 查看所有索引
GET /_cat/indices
4.4. 删除索引
DELETE /page_view_info
4.5. 创建索引模板
PUT /_template/page_view_info_template
{
"index_patterns": "page_view_info_*",
"aliases": {
"page_view_info": {}
},
"settings": {
"number_of_shards": 2
},
"mappings": {
"properties": {
"id": {
"type": "long"
},
"companyId": {
"type": "integer"
},
"landingPageId": {
"type": "integer"
},
"formId": {
"type": "integer"
},
"ip": {
"type": "keyword"
},
"province": {
"type": "keyword"
},
"city": {
"type": "keyword"
},
"location": {
"type": "keyword"
},
"ua": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"device": {
"type": "keyword"
},
"os": {
"type": "keyword"
},
"browser": {
"type": "keyword"
},
"uid": {
"type": "keyword"
},
"sid": {
"type": "keyword"
},
"pid": {
"type": "keyword"
},
"referrer": {
"type": "keyword"
},
"origin": {
"type": "keyword"
},
"osType": {
"type": "keyword"
},
"browserType": {
"type": "keyword"
},
"networkType": {
"type": "keyword"
},
"medium": {
"type": "keyword"
},
"channel": {
"type": "keyword"
},
"priceRange": {
"type": "keyword"
},
"screenSize": {
"type": "keyword"
},
"screenDpi": {
"type": "keyword"
},
"brand": {
"type": "keyword"
},
"longitude": {
"type": "keyword"
},
"latitude": {
"type": "keyword"
},
"url": {
"type": "keyword"
},
"lengthOfStay": {
"type": "double"
},
"accessDepth": {
"type": "double"
},
"fill": {
"type": "byte"
},
"submitDataType": {
"type": "byte"
},
"submitDataId": {
"type": "integer"
},
"content": {
"type": "keyword"
},
"fillContent": {
"type": "keyword"
},
"fillAt": {
"type": "date"
},
"year": {
"type": "short"
},
"month": {
"type": "byte"
},
"day": {
"type": "byte"
},
"hour": {
"type": "byte"
},
"minute": {
"type": "byte"
},
"createdAt": {
"type": "date"
},
"updatedAt": {
"type": "date"
},
"auditable": {
"type": "boolean"
},
"clickId": {
"type": "keyword"
},
"telgetIsAutoTel": {
"type": "boolean"
},
"telgetType": {
"type": "byte"
},
"submitDataExt": {
"type": "object"
}
}
}
}
4.6. 查看索引模板
GET /_template/page_view_info_template
4.7. 查看所有索引模板
GET /_cat/templates
4.8. 删除索引模板
DELETE /_template/page_view_info_template
5. 文档操作API
5.1. 查询文档
GET /users/_search
5.2. 创建文档
-
POST请求, Elasticsearch会自动生成id
POST /users/_doc
{
"username": "Joan",
"age": 11
}
-
PUT请求, 自己指定id
PUT /users/_create/1
{
"age": 33
}
第二次请求会报错.
5.3. 索引文档
PUT /users/_doc/1
{
"username": "Alex",
"age": 33
}
第二次执行后, 会删除原有文档再根据请求体重新创建文档, _version
字段加一.
5.4. 获取文档
GET /users/_doc/1
5.5. 更新文档
POST /users/_update/1
{
"doc": {
"age":111,
"username": "Bob"
}
}
5.6. 删除文档
DELETE /users/_doc/1
5.7. 批量更新/删除/索引
bulk
API对索引进行不同的操作.
批量处理过程中单条失败不会影响其他操作, 返回结果也返回了每条操作执行的结果.
POST /_bulk
{"create":{"_index":"users","_id":11}}
{"username":"Fatman"}
{"update":{"_index":"users","_id":11}}
{"doc":{"username":"Hollis"}}
{"index":{"_index":"users","_id":11}}
{"username":"Fatman"}
{"delete":{"_index":"users","_id":11}}
5.8. 批量获取
GET /_mget
{
"docs": [
{
"_index": "users",
"_id": 1
},
{
"_index": "users",
"_id": 2
}
]
}
6. URL查询
GET /<index>/_search
6.1. 参数
-
q: 查询条件
-
df: 查询字段, 如果为空则查询所有字段
-
sort: 排序, 格式为
<field>:[asc|desc]
# 查询存在一个字段包含"Mac"的文档
GET /page_view_info/_search?q=Mac
# 查询ua字段包含"Mac"的文档
GET /page_view_info/_search?q=ua:Mac
GET /page_view_info/_search?df=ua&q=Mac
# 查询ua包含"Mac"或者其他字段包含"Firefox"的文档
GET /page_view_info/_search?q=ua:Mac Firefox
# 查询ua包含"Mac"或者"Firefox"的文档
GET /page_view_info/_search?q=ua:(Mac Firefox)
# 查询ua包含"Mac Firefox"的文档
GET /page_view_info/_search?q=ua:"Mac Firefox"
6.2. 查询条件操作符
6.2.1. 布尔操作符
-
AND OR NOT
必须大写 -
+ -
# 查询ua包含"Mac"并且包含"Firefox"的文档
GET /page_view_info/_search?q=ua:(Mac AND Firefox)
GET /page_view_info/_search?q=ua:(+Mac AND +Firefox)
# 查询ua包含"Mac"并且但不包含"OS"的文档
GET /page_view_info/_search?q=ua:(Mac NOT OS)
GET /page_view_info/_search?q=ua:(+Mac -OS)
6.2.2. 范围查询操作符
# 查询year大于2018的文档
GET /page_view_info/_search?q=year:>2018
# 查询ua字段存在以"OP"开头的term的文档
GET /page_view_info/_search?q=ua:OP*
7. Query DSL
GET /page_view_info/_search
{
"sort": {"id": "desc"},
"_source": ["id", "pid", "createdAt"],
"from": 0,
"size": 20,
"query": {}
}
7.1. query_string
# 查询ua或os字段包含"Mac"的文档
GET /page_view_info/_search
{
"query": {
"query_string": {
"query": "Mac",
"fields": ["ua", "os"]
}
}
}
7.2. match
# 查询ua包含"Mac"并且包含"Firefox"的文档
GET /page_view_info/_search
{
"query": {
"match": {
"ua": {
"query": "Mac Firefox",
"operator": "and"
}
}
}
}
7.3. match_phrase
# 查询ua包含"Mac Firefox"的文档
GET /page_view_info/_search
{
"query": {
"match_phrase": {
"ua": {
"query": "Mac Firefox"
}
}
}
}
7.4. bool
# 条件查询
GET /page_view_info/_search
{
"size": 1,
"query": {
"bool": {
"filter": [
{
"term": {
"companyId": {
"value": 759
}
}
},
{
"term": {
"auditable": {
"value": true
}
}
},
{
"range": {
"createdAt": {
"from": "2020-01-06T00:00:00.000Z",
"to": "2020-01-06T23:59:59.000Z",
"include_lower": true,
"include_upper": true
}
}
},
{
"script": {
"script": {
"source": "String city = doc['city'].value; return city !=null && city !='' && city !='Unknown'",
"lang": "painless"
}
}
}
]
}
}
}
8. 聚合查询
Elasticsearch聚合查询分为4种:
-
Metric: 对文档字段进行数学运算或统计分析.
-
Bucket: 将文档按照条件分组.
-
Pipeline: 对聚合结果进行二次聚合.
-
Matrix: 对多个字段操作, 结果作为矩阵形式.
8.1. Metric
8.1.1. avg
平均值
GET /page_view_info/_search?size=0
{
"aggs": {
"fillRate": {
"avg": {
"field": "fill"
}
}
}
}
8.1.2. weighted_avg
加权平均值: \$(sum(weight * value)) / (sum(weight))\$
GET /page_view_info/_search?size=0
{
"size": 0,
"aggs": {
"weightAvgST": {
"weighted_avg": {
"value": {
"field": "lengthOfStay"
},
"weight": {
"field": "fill"
}
}
}
}
}
8.1.3. cardinality
distinct count
GET /page_view_info/_search?size=0
{
"aggs": {
"uv": {
"cardinality": {
"field": "uid"
}
}
}
}
8.1.4. stats
统计信息, 包括平均值, 最大值, 最小值, 总和, 次数计数.
GET /page_view_info/_search?size=0
{
"aggs": {
"statsST": {
"stats": {
"field": "lengthOfStay"
}
}
}
}
8.1.5. extended_stats
详细统计, 包括平均值, 最大值, 最小值, 标准差, 方差, 平方和等维度.
GET /page_view_info/_search?size=0
{
"aggs": {
"extendedStatsST": {
"extended_stats": {
"field": "lengthOfStay"
}
}
}
}
8.1.6. max
最大值
GET /page_view_info/_search?size=0
{
"aggs": {
"maxST": {
"max": {
"field": "lengthOfStay"
}
}
}
}
8.1.7. min
最小值
GET /page_view_info/_search?size=0
{
"aggs": {
"minST": {
"min": {
"field": "lengthOfStay"
}
}
}
}
8.1.8. sum
求和
GET /page_view_info/_search?size=0
{
"aggs": {
"fillAmount": {
"sum": {
"field": "fill"
}
}
}
}
8.1.9. value_count
计数
GET /page_view_info/_search?size=0
{
"aggs": {
"pv": {
"value_count": {
"field": "id"
}
}
}
}
8.1.10. percentiles
百分位
GET /page_view_info/_search?size=0
{
"aggs": {
"stPercentiles": {
"percentiles": {
"field": "lengthOfStay"
}
}
}
}
8.1.11. percentile_ranks
数值所处的百分位
.查看 lengthOfStay 小于2秒和120秒的百分比
GET /page_view_info/_search?size=0
{
"aggs": {
"stPercentileRanks": {
"percentile_ranks": {
"field": "lengthOfStay",
"values": [2, 120]
}
}
}
}
8.1.12. top_hits
分组后排序/取前几条记录
.查看每个落地页最近10条PV
GET /page_view_info/_search?size=0
{
"size": 0,
"aggs": {
"company": {
"terms": {
"field": "landingPageId"
},
"aggs": {
"topPV": {
"top_hits": {
"_source": ["pid"],
"sort": [{
"createdAt": {
"order": "desc"
}
}],
"size": 10
}
}
}
}
}
}
8.2. Bucket
8.2.1. term
将文档按照指定field分组
GET /page_view_info/_search?size=0
{
"size": 0,
"aggs": {
"company": {
"terms": {
"field": "landingPageId"
}
}
}
}
8.2.2. range
将文档按照范围分组
# lengthOfStay按[2,2-60,60-120,120]分组
GET /page_view_info/_search?size=0
{
"size":0,
"aggs": {
"st": {
"range": {
"field": "lengthOfStay",
"ranges": [
{
"to": 2
},
{
"from": 2,
"to": 60
},
{
"from": 60,
"to": 120
},
{
"from": 120
}
]
}
}
}
}
8.2.3. histogram
将文档按照一定的间隔大小分组
# accessDepth直方图
GET /page_view_info/_search?size=0
{
"size":0,
"aggs": {
"st": {
"histogram": {
"field": "accessDepth",
"interval": 10
}
}
}
}
8.2.4. filter
为某一聚合查询添加过滤条件
GET /sales/_search?size=0
{
"aggs" : {
"t_shirts" : {
"filter" : { "term": { "type": "t-shirt" } },
"aggs" : {
"avg_price" : { "avg" : { "field" : "price" } }
}
}
}
}
8.2.5. filters
根据条件分组
#分别查看公司759和790的PV数量
GET /page_view_info/_search?size=0
{
"aggs": {
"landingPages": {
"filters": {
"filters": {
"c759": {
"bool": {
"filter": {
"term": {
"companyId": {
"value": 759
}
}
}
}
},
"c790": {
"bool": {
"filter": {
"term": {
"companyId": {
"value": 790
}
}
}
}
}
}
}
}
}
}
8.2.6. global
使该聚合查询忽略query查询条件
#统计759和所有公司的pv
GET /page_view_info/_search?size=0
{
"query": {
"bool": {
"must": [
{
"term": {
"companyId": {
"value": 759
}
}
},
{
"range": {
"createdAt": {
"gte": "now/d-60d"
}
}
}
]
}
},
"aggs": {
"pv": {
"value_count": {
"field": "pid"
}
},
"all_pv": {
"global": {},
"aggs": {
"all_pv": {
"value_count": {
"field": "pid"
}
}
}
}
}
}
8.2.7. missing
统计没有指定字段的文档数量
# 统计没有clickId的文档数量
GET /page_view_info/_search?size=0
{
"aggs": {
"missPV": {
"missing": {
"field": "clickId"
}
}
}
}
8.3. Pipeline
8.3.1. max_bucket
# 找到pv最多的companyId
GET /page_view_info/_search?size=0
{
"size": 0,
"aggs": {
"company": {
"terms": {
"field": "companyId"
},
"aggs": {
"pv": {
"cardinality": {
"field": "id"
}
}
}
},
"maxPvCompany": {
"max_bucket": {
"buckets_path": "company>pv"
}
}
}
}