Deploy Elasticsearch based on memory storage - 100 million+ pieces of data, full-text search 100ms response-AI-php.cn

1. Mount the memory storage directory on the host

Create a directory for mounting

mkdir /mnt/memory_storage

Copy after login

Mount the tmpfs file system

mount -t tmpfs -o size=800G tmpfs /mnt/memory_storage

Copy after login

The storage space will be used on demand, that is, 100G of memory will be occupied when 100G storage is used. There is 2T memory on the host node, and 800G memory is allocated here to store Elasticsearch data.

Create the directory in advance

mkdir /mnt/memory_storage/elasticsearch-data-es-jfs-prod-es-default-0mkdir /mnt/memory_storage/elasticsearch-data-es-jfs-prod-es-default-1mkdir /mnt/memory_storage/elasticsearch-data-es-jfs-prod-es-default-2

Copy after login

If the directory is not created in advance and given read and write permissions, the Elasticsearch component will not be able to start. Prompt that multiple nodes use the same data directory.

Configure directory permissions

chmod -R 777 /mnt/memory_storage

Copy after login

DD Test IO bandwidth

dd if=/dev/zero of=/mnt/memory_storage/dd.txt bs=4M count=25002500+0 records in2500+0 records out10485760000 bytes (10 GB, 9.8 GiB) copied, 3.53769 s, 3.0 GB/s

Copy after login

Clean files

rm -rf /mnt/memory_storage/dd.txt

Copy after login

FIO test IO bandwidth

fio --name=test --filename=/mnt/memory_storage/fio_test_file --size=10G --rw=write --bs=4M --numjobs=1 --runtime=60 --time_basedRun status group 0 (all jobs):WRITE: bw=2942MiB/s (3085MB/s), 2942MiB/s-2942MiB/s (3085MB/s-3085MB/s), io=172GiB (185GB), run=60001-60001msec

Copy after login

Clean files

rm -rf /mnt/memory_storage/fio_test_file

Copy after login

Test the memory IO bandwidth

mbw 10000Long uses 8 bytes. Allocating 2*1310720000 elements = 20971520000 bytes of memory.Using 262144 bytes as blocks for memcpy block copy test.Getting down to business... Doing 10 runs per test.0 Method: MEMCPY Elapsed: 1.62143 MiB: 10000.00000 Copy: 6167.380 MiB/s1 Method: MEMCPY Elapsed: 1.63542 MiB: 10000.00000 Copy: 6114.656 MiB/s2 Method: MEMCPY Elapsed: 1.63345 MiB: 10000.00000 Copy: 6121.997 MiB/s3 Method: MEMCPY Elapsed: 1.63715 MiB: 10000.00000 Copy: 6108.161 MiB/s4 Method: MEMCPY Elapsed: 1.64429 MiB: 10000.00000 Copy: 6081.667 MiB/s5 Method: MEMCPY Elapsed: 1.62772 MiB: 10000.00000 Copy: 6143.574 MiB/s6 Method: MEMCPY Elapsed: 1.60684 MiB: 10000.00000 Copy: 6223.379 MiB/s7 Method: MEMCPY Elapsed: 1.62499 MiB: 10000.00000 Copy: 6153.876 MiB/s8 Method: MEMCPY Elapsed: 1.63967 MiB: 10000.00000 Copy: 6098.770 MiB/s9 Method: MEMCPY Elapsed: 2.97213 MiB: 10000.00000 Copy: 3364.588 MiB/sAVG Method: MEMCPY Elapsed: 1.76431 MiB: 10000.00000 Copy: 5667.937 MiB/s0 Method: DUMB Elapsed: 1.01521 MiB: 10000.00000 Copy: 9850.140 MiB/s1 Method: DUMB Elapsed: 0.85378 MiB: 10000.00000 Copy: 11712.605 MiB/s2 Method: DUMB Elapsed: 0.82487 MiB: 10000.00000 Copy: 12123.167 MiB/s3 Method: DUMB Elapsed: 0.84520 MiB: 10000.00000 Copy: 11831.463 MiB/s4 Method: DUMB Elapsed: 0.83050 MiB: 10000.00000 Copy: 12040.968 MiB/s5 Method: DUMB Elapsed: 0.84932 MiB: 10000.00000 Copy: 11774.194 MiB/s6 Method: DUMB Elapsed: 0.82491 MiB: 10000.00000 Copy: 12122.505 MiB/s7 Method: DUMB Elapsed: 1.44235 MiB: 10000.00000 Copy: 6933.144 MiB/s8 Method: DUMB Elapsed: 2.68656 MiB: 10000.00000 Copy: 3722.225 MiB/s9 Method: DUMB Elapsed: 8.44667 MiB: 10000.00000 Copy: 1183.898 MiB/sAVG Method: DUMB Elapsed: 1.86194 MiB: 10000.00000 Copy: 5370.750 MiB/s0 Method: MCBLOCK Elapsed: 4.52486 MiB: 10000.00000 Copy: 2210.013 MiB/s1 Method: MCBLOCK Elapsed: 4.82467 MiB: 10000.00000 Copy: 2072.683 MiB/s2 Method: MCBLOCK Elapsed: 0.84797 MiB: 10000.00000 Copy: 11792.870 MiB/s3 Method: MCBLOCK Elapsed: 0.84980 MiB: 10000.00000 Copy: 11767.516 MiB/s4 Method: MCBLOCK Elapsed: 0.87665 MiB: 10000.00000 Copy: 11407.113 MiB/s5 Method: MCBLOCK Elapsed: 0.85952 MiB: 10000.00000 Copy: 11634.468 MiB/s6 Method: MCBLOCK Elapsed: 0.84132 MiB: 10000.00000 Copy: 11886.154 MiB/s7 Method: MCBLOCK Elapsed: 0.84970 MiB: 10000.00000 Copy: 11768.915 MiB/s8 Method: MCBLOCK Elapsed: 0.86918 MiB: 10000.00000 Copy: 11505.150 MiB/s9 Method: MCBLOCK Elapsed: 0.85996 MiB: 10000.00000 Copy: 11628.434 MiB/sAVG Method: MCBLOCK Elapsed: 1.62036 MiB: 10000.00000 Copy: 6171.467 MiB/s

Copy after login

It seems that the memory is mounted as the IO of the file system The bandwidth can only reach half of the IO bandwidth of the memory.

2. Create PVC on Kubernetes cluster

Configure environment variables

export NAMESPACE=data-centerexport PVC_NAME=elasticsearch-data-es-jfs-prod-es-default-0

Copy after login

Create PV and PVC

kubectl create -f - <<EOFapiVersion: v1kind: PersistentVolumemetadata:name: ${PVC_NAME}namespace: ${NAMESPACE}spec:accessModes:- ReadWriteManycapacity:storage: 800GihostPath:path: /mnt/memory_storage/${PVC_NAME}---apiVersion: v1kind: PersistentVolumeClaimmetadata:name: ${PVC_NAME}namespace: ${NAMESPACE}spec:accessModes:- ReadWriteManyresources:requests:storage: 800GiEOF

Copy after login

Create at least 3 PVC applications by modifying the PVC_NAME variable. Finally, I created 20 PVCs, providing a total of 15+ TB of storage.

3. Deploy Elasticsearch related components

Some content is omitted here. For details, please refer to Using JuiceFS to store Elasticsearch data[1].

Deploy Elasticsearch

cat <<EOF | kubectl apply -f -apiVersion: elasticsearch.k8s.elastic.co/v1kind: Elasticsearchmetadata:namespace: $NAMESPACEname: es-jfs-prodspec:version: 8.3.2image: hubimage/elasticsearch:8.3.2http:tls:selfSignedCertificate:disabled: truenodeSets:- name: defaultcount: 3config:node.store.allow_mmap: falseindex.store.type: niofspodTemplate:spec:nodeSelector:servertype: Ascend910B-24initContainers:- name: sysctlsecurityContext:privileged: truerunAsUser: 0command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']- name: install-pluginscommand:- sh- -c- |bin/elasticsearch-plugin install --batch https://get.infini.cloud/elasticsearch/analysis-ik/8.3.2securityContext:runAsUser: 0runAsGroup: 0containers:- name: elasticsearchreadinessProbe:exec:command:- bash- -c- /mnt/elastic-internal/scripts/readiness-probe-script.shfailureThreshold: 10initialDelaySeconds: 30periodSeconds: 30successThreshold: 1timeoutSeconds: 30env:- name: "ES_JAVA_OPTS"value: "-Xms31g -Xmx31g"- name: "NSS_SDB_USE_CACHE"value: "no"resources:requests:cpu: 8memory: 64GiEOF

Copy after login

View Elasticsearch password

kubectl -n $NAMESPACE get secret es-jfs-prod-es-elastic-user -o go-template='{{.data.elastic | base64decode}}'xxx

Copy after login

Default The username is elastic

Deploy Metricbeat

kubectl apply -f - <<EOFapiVersion: beat.k8s.elastic.co/v1beta1kind: Beatmetadata:name: es-jfs-prodnamespace: $NAMESPACEspec:type: metricbeatversion: 8.3.2elasticsearchRef:name: es-jfs-prodconfig:metricbeat:autodiscover:providers:- type: kubernetesscope: clusterhints.enabled: truetemplates:- config:- module: kubernetesmetricsets:- eventperiod: 10sprocessors:- add_cloud_metadata: {}logging.json: truedeployment:podTemplate:spec:serviceAccountName: metricbeatautomountServiceAccountToken: true# required to read /etc/beat.ymlsecurityContext:runAsUser: 0EOF

Copy after login

Deploy Kibana

cat <<EOF | kubectl apply -f -apiVersion: kibana.k8s.elastic.co/v1kind: Kibanametadata:namespace: $NAMESPACEname: es-jfs-prodspec:version: 8.3.2count: 1image: hubimage/kibana:8.3.2elasticsearchRef:name: es-jfs-prodhttp:tls:selfSignedCertificate:disabled: trueEOF

Copy after login

View Elasticsearch cluster information

部署基于内存存储的 Elasticsearch - 一亿+条数据，全文检索 100ms 响应 Picture

4. Import data

Create index

Execute in the Dev Tools page of Elasticsearch Management:

PUT /bayou_tt_articles{"settings": {"index": {"number_of_shards": 30,"number_of_replicas": 1,"refresh_interval": "120s","translog.durability": "async","translog.sync_interval": "120s","translog.flush_threshold_size": "2048M"}},"mappings": {"properties": {"text": {"type": "text","analyzer": "ik_smart"}}}}

Copy after login

There are two things to note:

Keep each point The slice size is between 10-50G, here number_of_shards is set to 30, because there are a few hundred GB of data that need to be imported.
The number of replicas is at least 1 to ensure that the Pod will not lose data during rolling updates. When the IP of the Pod changes, Elasticsearch will consider it a new node and cannot reuse the previous data. If there is no copy to rebuild the shards at this time, data loss will occur.

Install the import tool

You can also use the elasticdump container to import, and there are examples below. Installed here using npm.

apt-get install npm -y

Copy after login

npm install elasticdump -g

Copy after login

Import data

export DATAPATH=./bayou_tt_articles_0.jsonlnohup elasticdump --limit 20000 --input=${DATAPATH} --output=http://elastic:xxx@x.x.x.x:31391/ --output-index=bayou_tt_articles --type=data --transform="doc._source=Object.assign({},doc)" > elasticdump-${DATAPATH}.log 2>&1 &

Copy after login

limit 表示每次导入的数据条数，默认值是 100 太小，建议在保障导入成功的前提下尽可能大一点。

查看索引速率

部署基于内存存储的 Elasticsearch - 一亿+条数据，全文检索 100ms 响应图片

索引速率达到 1w+/s，但上限远不止于此。因为，根据社区文档的压力测试结果显示，单个节点至少能提供 2W/s 的索引速率。

5. 测试与验证

全文检索性能显著提升

部署基于内存存储的 Elasticsearch - 一亿+条数据，全文检索 100ms 响应图片

上图是使用 JuiceFS 存储的全文检索速度为 18s，使用 SSD 节点的 Elasticsearch 的全文检索速度为 5s。下图是使用内存存储的 Elasticsearch 的全文检索速度为 100ms 左右。

部署基于内存存储的 Elasticsearch - 一亿+条数据，全文检索 100ms 响应图片

更新 Elasticsearch 不会丢数据

之前给 Elasticsearch Pod 分配的 CPU 和 Memory 太多，调整为 CPU 32C，Memory 64 GB。在滚动更新过程中，Elasticsearch 始终可用，并且数据没有丢失。

但务必注意设置 replicas > 1，尽量不要自行重启 Pod，虽然 Pod 是原节点更新。

能平稳实现节点的扩容

部署基于内存存储的 Elasticsearch - 一亿+条数据，全文检索 100ms 响应图片

由于业务总的 Elasticsearch 存储需求是 10T 左右，我继续增加节点到 10 个，Elasticsearch 的索引分片会自动迁移，均匀分布在这些节点上。

导出索引速度达 1w 条每秒

docker run --rm -ti elasticdump/elasticsearch-dump --limit 10000 --input=http://elastic:xxx@x.x.x.x:31391/bayou_tt_articles --output=/data/es-bayou_tt_articles-output.json --type=data

Copy after login

Wed, 29 May 2024 01:41:23 GMT | got 10000 objects from source elasticsearch (offset: 0)Wed, 29 May 2024 01:41:23 GMT | sent 10000 objects to destination file, wrote 10000Wed, 29 May 2024 01:41:24 GMT | got 10000 objects from source elasticsearch (offset: 10000)Wed, 29 May 2024 01:41:24 GMT | sent 10000 objects to destination file, wrote 10000Wed, 29 May 2024 01:41:25 GMT | got 10000 objects from source elasticsearch (offset: 20000)Wed, 29 May 2024 01:41:25 GMT | sent 10000 objects to destination file, wrote 10000Wed, 29 May 2024 01:41:25 GMT | got 10000 objects from source elasticsearch (offset: 30000)

Copy after login

导出速度能达到 1w 条每秒，一亿条数据大约需要 3h，基本也能满足索引的备份、迁移需求。

Elasticsearch 节点 Pod 更新时，不会发生漂移

更新之前的 Pod 分布节点如下：

NAME READY STATUSRESTARTSAGE IP NODE NOMINATED NODE READINESS GATESes-jfs-prod-beat-metricbeat-7fbdd657c4-djgg6 1/1 Running 6 (32m ago) 18h 10.244.54.5ascend-01 <none> <none>es-jfs-prod-es-default-0 1/1 Running 0 28m 10.244.46.82 ascend-07 <none> <none>es-jfs-prod-es-default-1 1/1 Running 0 29m 10.244.23.77 ascend-53 <none> <none>es-jfs-prod-es-default-2 1/1 Running 0 31m 10.244.49.65 ascend-20 <none> <none>es-jfs-prod-es-default-3 1/1 Running 0 32m 10.244.54.14 ascend-01 <none> <none>es-jfs-prod-es-default-4 1/1 Running 0 34m 10.244.100.239 ascend-40 <none> <none>es-jfs-prod-es-default-5 1/1 Running 0 35m 10.244.97.201ascend-39 <none> <none>es-jfs-prod-es-default-6 1/1 Running 0 37m 10.244.101.156 ascend-38 <none> <none>es-jfs-prod-es-default-7 1/1 Running 0 39m 10.244.19.101ascend-49 <none> <none>es-jfs-prod-es-default-8 1/1 Running 0 40m 10.244.16.109ascend-46 <none> <none>es-jfs-prod-es-default-9 1/1 Running 0 41m 10.244.39.119ascend-15 <none> <none>es-jfs-prod-kb-75f7bbd96-6tcrn 1/1 Running 0 18h 10.244.1.164 ascend-22 <none> <none>

Copy after login

更新之后的 Pod 分布节点如下：

NAME READY STATUSRESTARTSAGE IP NODE NOMINATED NODE READINESS GATESes-jfs-prod-beat-metricbeat-7fbdd657c4-djgg6 1/1 Running 6 (50m ago) 18h 10.244.54.5ascend-01 <none> <none>es-jfs-prod-es-default-0 1/1 Running 0 72s 10.244.46.83 ascend-07 <none> <none>es-jfs-prod-es-default-1 1/1 Running 0 2m35s 10.244.23.78 ascend-53 <none> <none>es-jfs-prod-es-default-2 1/1 Running 0 3m59s 10.244.49.66 ascend-20 <none> <none>es-jfs-prod-es-default-3 1/1 Running 0 5m34s 10.244.54.15 ascend-01 <none> <none>es-jfs-prod-es-default-4 1/1 Running 0 7m21s 10.244.100.240 ascend-40 <none> <none>es-jfs-prod-es-default-5 1/1 Running 0 8m44s 10.244.97.202ascend-39 <none> <none>es-jfs-prod-es-default-6 1/1 Running 0 10m 10.244.101.157 ascend-38 <none> <none>es-jfs-prod-es-default-7 1/1 Running 0 11m 10.244.19.102ascend-49 <none> <none>es-jfs-prod-es-default-8 1/1 Running 0 13m 10.244.16.110ascend-46 <none> <none>es-jfs-prod-es-default-9 1/1 Running 0 14m 10.244.39.120ascend-15 <none> <none>es-jfs-prod-kb-75f7bbd96-6tcrn 1/1 Running 0 18h 10.244.1.164 ascend-22 <none> <none>

Copy after login

这点打消了我的一个顾虑， Elasticsearch 的 Pod 重启时，发生了漂移，那么节点上是否会残留分片的数据，导致内存使用不断膨胀？答案是，不会。ECK Operator 似乎能让 Pod 在原节点进行重启，挂载的 Hostpath 数据依然对新的 Pod 有效，仅当主机节点发生重启时，才会丢失数据。

6. 总结

AI 的算力节点有大量空闲的 CPU 和 Memory 资源，使用这些大内存的主机节点，部署一些短生命周期的基于内存存储的高性能应用，有利于提高资源的使用效率。

本篇主要介绍了借助于 Hostpath 的内存存储部署 Elasticsearch 提供高性能查询能力的方案，具体内容如下：

将内存 mount 目录到主机上
创建基于 Hostpath 的 PVC，将数据挂载到上述目录
使用 ECK Operator 部署 Elasticsearch
Elasticsearch 更新时，数据并不会丢失，但不能同时重启多个主机节点
300+GB、一亿+条数据，全文检索响应场景中，基于 JuiceFS 存储的速度为 18s， SSD 节点的速度为 5s，内存节点的速度为 100ms

参考资料

[1]使用 JuiceFS 存储 Elasticsearch 数据: https://www.chenshaowen.com/blog/store-elasticsearch-data-in-juicefs.html

The above is the detailed content of Deploy Elasticsearch based on memory storage - 100 million+ pieces of data, full-text search 100ms response. For more information, please follow other related articles on the PHP Chinese website!