Crushmap CEPH 2 Datacenter
Comment modifier la Crushmap de CEPH pour 2 Datacenter
Objectif: Résilience des données, en cas de perte d'un Datacenter (garantie d'avoir 2 replicas par Datacenter). Le choix du nombre de réplicas par pool à été choisi à 4, afin qu'en cas de perte d'un Datacenter, la remise en état des réplicas perdus ne reposent pas sur le seul réplica restant (exemple: replica=3 ou 2), et donc qu'un dernier OSD nous claque entre entre les mains pendant la resynchronisation.
Proof of Concept:
2 Datacenters + une salle pour le Quorum des monitors.
Type | DCA | DCB | Quorum |
monitor | node1001 | node1002 | node1003 |
OSD | node1011 | node1012 | |
OSD | node1013 | node1014 | |
Rados GW | node1091 | node1092 | |
Admin | node1099 |
Extraire la Crushmap (raw) d'origine
Décompiler la Crushmap crushmap vers le fichier crushmap.txt
Contenu de la Crushmap d'origine : editer crushmap.txt
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host node1011 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.1 weight 0.010
item osd.4 weight 0.010
item osd.8 weight 0.010
item osd.17 weight 0.010
item osd.19 weight 0.010
}
host node1012 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.010
item osd.6 weight 0.010
item osd.10 weight 0.010
item osd.13 weight 0.010
item osd.16 weight 0.010
}
host node1013 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.2 weight 0.010
item osd.7 weight 0.010
item osd.11 weight 0.010
item osd.14 weight 0.010
item osd.18 weight 0.010
}
host node1014 {
id -9 # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.3 weight 0.010
item osd.5 weight 0.010
item osd.12 weight 0.010
item osd.9 weight 0.010
item osd.15 weight 0.010
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 0.196
alg straw2
hash 0 # rjenkins1
item node1011 weight 0.049
item node1012 weight 0.049
item node1013 weight 0.049
item node1014 weight 0.049
}
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
Editer le fichier Crushmap.txt pour répartir un pool CEPH à 4 replicas entre les deux Datacenters.
voir rule:
step choose firstn 2 type datacenter
step chooseleaf firstn 2 type rack
On simule également dans cette configuration que chaque noeud OSD est dans un rack dédié.
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host node1011 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.1 weight 0.010
item osd.4 weight 0.010
item osd.8 weight 0.010
item osd.17 weight 0.010
item osd.19 weight 0.010
}
host node1012 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.010
item osd.6 weight 0.010
item osd.10 weight 0.010
item osd.13 weight 0.010
item osd.16 weight 0.010
}
host node1013 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.2 weight 0.010
item osd.7 weight 0.010
item osd.11 weight 0.010
item osd.14 weight 0.010
item osd.18 weight 0.010
}
host node1014 {
id -9 # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.3 weight 0.010
item osd.5 weight 0.010
item osd.12 weight 0.010
item osd.9 weight 0.010
item osd.15 weight 0.010
}
rack rackA001 {
id -11 # do not change unnecessarily
id -12 class hdd # do not change unnecessarily
# weight 4.880
alg straw2
hash 0 # rjenkins1
item node1011 weight 0.050
}
rack rackA002 {
id -13 # do not change unnecessarily
id -14 class hdd # do not change unnecessarily
# weight 4.880
alg straw2
hash 0 # rjenkins1
item node1013 weight 0.050
}
rack rackB001 {
id -15 # do not change unnecessarily
id -16 class hdd # do not change unnecessarily
# weight 4.880
alg straw2
hash 0 # rjenkins1
item node1012 weight 0.050
}
rack rackB002 {
id -17 # do not change unnecessarily
id -18 class hdd # do not change unnecessarily
# weight 4.880
alg straw2
hash 0 # rjenkins1
item node1014 weight 0.050
}
datacenter DCA {
id -19 # do not change unnecessarily
id -20 class hdd # do not change unnecessarily
# weight 9.760
alg straw2
hash 0 # rjenkins1
item rackA001 weight 0.050
item rackA002 weight 0.050
}
datacenter DCB {
id -21 # do not change unnecessarily
id -22 class hdd # do not change unnecessarily
# weight 9.760
alg straw2
hash 0 # rjenkins1
item rackB001 weight 0.050
item rackB002 weight 0.050
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 19.520
alg straw2
hash 0 # rjenkins1
item DCA weight 0.100
item DCB weight 0.100
}
# rules
rule replicated_rule {
id 0
type replicated
min_size 2
max_size 4
step take default
step choose firstn 2 type datacenter
step chooseleaf firstn 2 type rack
step emit
}
# end crush map
Recompiler une crushmap crushmap2DC à partir de crushmap.txt modifié
Recharger la nouvelle crushmap crushmap2DC
vérifier la configuration des OSD
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.19998 root default
-19 0.09999 datacenter DCA
-11 0.04999 rack rackA001
-3 0.04999 host node1011
1 hdd 0.00999 osd.1 up 1.00000 1.00000
4 hdd 0.00999 osd.4 up 1.00000 1.00000
8 hdd 0.00999 osd.8 up 1.00000 1.00000
17 hdd 0.00999 osd.17 up 1.00000 1.00000
19 hdd 0.00999 osd.19 up 1.00000 1.00000
-13 0.04999 rack rackA002
-7 0.04999 host node1013
2 hdd 0.00999 osd.2 up 1.00000 1.00000
7 hdd 0.00999 osd.7 up 1.00000 1.00000
11 hdd 0.00999 osd.11 up 1.00000 1.00000
14 hdd 0.00999 osd.14 up 1.00000 1.00000
18 hdd 0.00999 osd.18 up 1.00000 1.00000
-21 0.09999 datacenter DCB
-15 0.04999 rack rackB001
-5 0.04999 host node1012
0 hdd 0.00999 osd.0 up 1.00000 1.00000
6 hdd 0.00999 osd.6 up 1.00000 1.00000
10 hdd 0.00999 osd.10 up 1.00000 1.00000
13 hdd 0.00999 osd.13 up 1.00000 1.00000
16 hdd 0.00999 osd.16 up 1.00000 1.00000
-17 0.04999 rack rackB002
-7 0.04999 host node1014
3 hdd 0.00999 osd.3 up 1.00000 1.00000
5 hdd 0.00999 osd.5 up 1.00000 1.00000
9 hdd 0.00999 osd.9 up 1.00000 1.00000
12 hdd 0.00999 osd.12 up 1.00000 1.00000
15 hdd 0.00999 osd.15 up 1.00000 1.00000
Vérifier le "rebalancing" des OSD
$ ceph health detail
vérification de la bonne répartition d'un pool a 4 replicas
$ ceph osd pool set pool001 size 4
$ rbd create rbd001 --size 1024 --pool pool001 --image-feature layering
$ ceph osd pool application enable rbd001 rbd
Monter le RDB sur un client CEPH et y coller quelques données de test.
/dev/rbd0
$ sudo rbd showmapped
id pool image snap device
0 pool001 rbd001 - /dev/rbd0
$ sudo mkfs.ext4 /dev/rbd0
$ sudo mount /dev/rbd0 /mnt
$ sudo dd if=/dev/zero of=/mnt/zerofile bs=1M count=3
Vérifier le placement du RBD rbd001
osdmap e328 pool 'pool001' (14) object 'rbd001' -> pg 14.f9426629 (14.29) -> up ([17,12,18,0], p17) acting ([17,12,18,0], p17)
Le placement group utilise les OSDs 17, 12, 18 et 0 qui sont bien sur des noeuds et racks différents. vérifier avec la commande suivante:
Aucun trackbacks pour l'instant