Crushmap CEPH 2 Datacenter
Comment modifier la Crushmap de CEPH pour 2 Datacenter
Objectif: Résilience des données, en cas de perte d'un Datacenter (garantie d'avoir 2 replicas par Datacenter). Le choix du nombre de réplicas par pool à été choisi à 4, afin qu'en cas de perte d'un Datacenter, la remise en état des réplicas perdus ne reposent pas sur le seul réplica restant (exemple: replica=3 ou 2), et donc qu'un dernier OSD nous claque entre entre les mains pendant la resynchronisation.
Proof of Concept:
2 Datacenters + une salle pour le Quorum des monitors.
Type | DCA | DCB | Quorum |
monitor | node1001 | node1002 | node1003 |
OSD | node1011 | node1012 | |
OSD | node1013 | node1014 | |
Rados GW | node1091 | node1092 | |
Admin | node1099 |
Extraire la Crushmap (raw) d'origine
Décompiler la Crushmap crushmap vers le fichier crushmap.txt
Contenu de la Crushmap d'origine : editer crushmap.txt
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host node1011 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.1 weight 0.010
item osd.4 weight 0.010
item osd.8 weight 0.010
item osd.17 weight 0.010
item osd.19 weight 0.010
}
host node1012 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.010
item osd.6 weight 0.010
item osd.10 weight 0.010
item osd.13 weight 0.010
item osd.16 weight 0.010
}
host node1013 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.2 weight 0.010
item osd.7 weight 0.010
item osd.11 weight 0.010
item osd.14 weight 0.010
item osd.18 weight 0.010
}
host node1014 {
id -9 # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.3 weight 0.010
item osd.5 weight 0.010
item osd.12 weight 0.010
item osd.9 weight 0.010
item osd.15 weight 0.010
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 0.196
alg straw2
hash 0 # rjenkins1
item node1011 weight 0.049
item node1012 weight 0.049
item node1013 weight 0.049
item node1014 weight 0.049
}
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
Editer le fichier Crushmap.txt pour répartir un pool CEPH à 4 replicas entre les deux Datacenters.
voir rule:
step choose firstn 2 type datacenter
step chooseleaf firstn 2 type rack
On simule également dans cette configuration que chaque noeud OSD est dans un rack dédié.
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host node1011 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.1 weight 0.010
item osd.4 weight 0.010
item osd.8 weight 0.010
item osd.17 weight 0.010
item osd.19 weight 0.010
}
host node1012 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.010
item osd.6 weight 0.010
item osd.10 weight 0.010
item osd.13 weight 0.010
item osd.16 weight 0.010
}
host node1013 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.2 weight 0.010
item osd.7 weight 0.010
item osd.11 weight 0.010
item osd.14 weight 0.010
item osd.18 weight 0.010
}
host node1014 {
id -9 # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
# weight 0.049
alg straw2
hash 0 # rjenkins1
item osd.3 weight 0.010
item osd.5 weight 0.010
item osd.12 weight 0.010
item osd.9 weight 0.010
item osd.15 weight 0.010
}
rack rackA001 {
id -11 # do not change unnecessarily
id -12 class hdd # do not change unnecessarily
# weight 4.880
alg straw2
hash 0 # rjenkins1
item node1011 weight 0.050
}
rack rackA002 {
id -13 # do not change unnecessarily
id -14 class hdd # do not change unnecessarily
# weight 4.880
alg straw2
hash 0 # rjenkins1
item node1013 weight 0.050
}
rack rackB001 {
id -15 # do not change unnecessarily
id -16 class hdd # do not change unnecessarily
# weight 4.880
alg straw2
hash 0 # rjenkins1
item node1012 weight 0.050
}
rack rackB002 {
id -17 # do not change unnecessarily
id -18 class hdd # do not change unnecessarily
# weight 4.880
alg straw2
hash 0 # rjenkins1
item node1014 weight 0.050
}
datacenter DCA {
id -19 # do not change unnecessarily
id -20 class hdd # do not change unnecessarily
# weight 9.760
alg straw2
hash 0 # rjenkins1
item rackA001 weight 0.050
item rackA002 weight 0.050
}
datacenter DCB {
id -21 # do not change unnecessarily
id -22 class hdd # do not change unnecessarily
# weight 9.760
alg straw2
hash 0 # rjenkins1
item rackB001 weight 0.050
item rackB002 weight 0.050
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 19.520
alg straw2
hash 0 # rjenkins1
item DCA weight 0.100
item DCB weight 0.100
}
# rules
rule replicated_rule {
id 0
type replicated
min_size 2
max_size 4
step take default
step choose firstn 2 type datacenter
step chooseleaf firstn 2 type rack
step emit
}
# end crush map
Recompiler une crushmap crushmap2DC à partir de crushmap.txt modifié
Recharger la nouvelle crushmap crushmap2DC
vérifier la configuration des OSD
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.19998 root default
-19 0.09999 datacenter DCA
-11 0.04999 rack rackA001
-3 0.04999 host node1011
1 hdd 0.00999 osd.1 up 1.00000 1.00000
4 hdd 0.00999 osd.4 up 1.00000 1.00000
8 hdd 0.00999 osd.8 up 1.00000 1.00000
17 hdd 0.00999 osd.17 up 1.00000 1.00000
19 hdd 0.00999 osd.19 up 1.00000 1.00000
-13 0.04999 rack rackA002
-7 0.04999 host node1013
2 hdd 0.00999 osd.2 up 1.00000 1.00000
7 hdd 0.00999 osd.7 up 1.00000 1.00000
11 hdd 0.00999 osd.11 up 1.00000 1.00000
14 hdd 0.00999 osd.14 up 1.00000 1.00000
18 hdd 0.00999 osd.18 up 1.00000 1.00000
-21 0.09999 datacenter DCB
-15 0.04999 rack rackB001
-5 0.04999 host node1012
0 hdd 0.00999 osd.0 up 1.00000 1.00000
6 hdd 0.00999 osd.6 up 1.00000 1.00000
10 hdd 0.00999 osd.10 up 1.00000 1.00000
13 hdd 0.00999 osd.13 up 1.00000 1.00000
16 hdd 0.00999 osd.16 up 1.00000 1.00000
-17 0.04999 rack rackB002
-7 0.04999 host node1014
3 hdd 0.00999 osd.3 up 1.00000 1.00000
5 hdd 0.00999 osd.5 up 1.00000 1.00000
9 hdd 0.00999 osd.9 up 1.00000 1.00000
12 hdd 0.00999 osd.12 up 1.00000 1.00000
15 hdd 0.00999 osd.15 up 1.00000 1.00000
Vérifier le "rebalancing" des OSD
$ ceph health detail
vérification de la bonne répartition d'un pool a 4 replicas
$ ceph osd pool set pool001 size 4
$ rbd create rbd001 --size 1024 --pool pool001 --image-feature layering
$ ceph osd pool application enable rbd001 rbd
Monter le RDB sur un client CEPH et y coller quelques données de test.
/dev/rbd0
$ sudo rbd showmapped
id pool image snap device
0 pool001 rbd001 - /dev/rbd0
$ sudo mkfs.ext4 /dev/rbd0
$ sudo mount /dev/rbd0 /mnt
$ sudo dd if=/dev/zero of=/mnt/zerofile bs=1M count=3
Vérifier le placement du RBD rbd001
osdmap e328 pool 'pool001' (14) object 'rbd001' -> pg 14.f9426629 (14.29) -> up ([17,12,18,0], p17) acting ([17,12,18,0], p17)
Le placement group utilise les OSDs 17, 12, 18 et 0 qui sont bien sur des noeuds et racks différents. vérifier avec la commande suivante:
Installer un cluster Red Hat CEPH Storage 3.0 Luminous
Proof of Concept CEPH Luminous
Dans ce post je vous montre comment installer Fast & Dirty un cluster Red Hat Storage 3.0 "CEPH Luminous" sur une Infra à base de Machines virtuelles (non supportée).
Pour plus d'information RTFM
Sinon la Thèse de Sage Weil qui est le fondateur et l'architecte en chef de Ceph n'est pas dépourvu d'intérêt. weil-thesis
Contrairement a SUSE enterprise Storage qui a fait le bon choix d'intégrer OpenAttic pour gérer le cluster Ceph via une interface web ou Rest-API, Red Hat ne propose rien pour ça dans cette version.
Configuration :
Stretch Ceph Cluster
2 Datacenters distant de 40km avec une latence de 0,5ms + une salle pour le Quorum des monitors.
Type | DCA | DCB | Quorum |
monitor | node1001 | node1002 | node1003 |
OSD | node1011 | node1012 | |
OSD | node1013 | node1014 | |
Rados GW | node1091 | node1092 | |
Admin | node1099 |
Configuration Disque OSD:
/dev/sdb disque journal 50Go
/dev/sdc disque OSD 10Go
/dev/sdd disque OSD 10Go
/dev/sde disque OSD 10Go
/dev/sdf disque OSD 10Go
/dev/sdg disque OSD 10Go
NAME HCTL TYPE VENDOR MODEL REV TRAN NAME SIZE OWNER GROUP MODE
sda 0:0:0:0 disk VMware Virtual disk 1.0 spi sda 50G root disk brw-rw----
sdb 0:0:1:0 disk VMware Virtual disk 1.0 spi sdb 50G root disk brw-rw----
sdc 0:0:2:0 disk VMware Virtual disk 1.0 spi sdc 10G root disk brw-rw----
sdd 0:0:3:0 disk VMware Virtual disk 1.0 spi sdd 10G root disk brw-rw----
sde 0:0:4:0 disk VMware Virtual disk 1.0 spi sde 10G root disk brw-rw----
sdf 0:0:5:0 disk VMware Virtual disk 1.0 spi sdf 10G root disk brw-rw----
sdg 0:0:6:0 disk VMware Virtual disk 1.0 spi sdg 10G root disk brw-rw----
sr0 2:0:0:0 rom NECVMWar VMware IDE CDR10 1.00 ata sr0 1024M root cdrom brw-rw----
Système: Red HAt 7.4
Effectuer les actions suivantes à partir du serveur d’administration du cluster CEPH : node1099
Partager la clé SSH de l'utilisateur root puis cephadm via ssh-keygen et ssh-copy-id avec tous les serveurs du cluster CEPH.
Ajouter les droits sudo à l'utilisateur cephadm.
créer un utilisateur cephadm
[root@node1099 ~]# for i in 01 02 03 11 12 13 14 91 92 99
do
ssh-copy-id node10$i
done
[root@node1099 ~]# base64 /dev/urandom | tr -d "/+" | dd bs="16" count=1 status=none | xargs echo;
cI23TmyTVPGWs97f
[root@node1099 ~]# for i in 01 02 03 11 12 13 14 91 92 99
do
ssh node10$i 'groupadd -g 2000 cephadm; useradd -u 2000 -g cephadm cephadm; echo "cI23TmyTVPGWs97f" | passwd --stdin cephadm'
done
[root@node1099 ~]# for i in 01 02 03 11 12 13 14 91 92 99
do
ssh node10$i 'echo "cephadm ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers'
done
[root@node1099 ~]# su - cephadm
[cephadm@node1099 ~]# cd .ssh; ssh-keygen
[cephadm@node1099 ~]# for i in 01 02 03 11 12 13 14 91 92 99
do
ssh-copy-id node10$i
done
On utilise NTP: Désinstaller chrony, copier le fichier /etc/ntp.conf, redémarrer ntpd et vérifier l'heure
Copier le fichier ntp.conf de référence sur le serveur d'administration node1099 puis lancer les actions suivantes:
do
scp ntp.conf node10$:/etc/ntp.conf
ssh node10$i rpm -e chrony
ssh node10$i systemctl restart ntpd
done
[root@node1099 ~]# for i in 01 02 03 11 12 13 14 91 92 99
do
echo node10$i $(ssh node10$i date)
done
node1001 Sun Oct 7 10:31:59 CET 2018
node1002 Sun Oct 7 10:31:59 CET 2018
node1003 Sun Nov 7 10:31:59 CET 2018
node1011 Sun Nov 7 10:32:00 CET 2018
node1012 Sun Oct 7 10:32:00 CET 2018
node1013 Sun Oct 7 10:32:00 CET 2018
node1014 Sun Oct 7 10:32:00 CET 2018
node1091 Sun Oct 7 10:32:00 CET 2018
node1092 Sun Oct 7 10:32:00 CET 2018
node1099 Sun Oct 7 10:32:00 CET 2018
Enregistrement SATELLITE 6
Explication serveur Satellite ... à venir.
Enregistrer les systèmes du cluster CEPH sur l'activation key qui permet d'accéder à la content view contenant les paquets Ceph recommandé par Red Hat (RTFM): AK-RHEL7-CEPH
do
ssh node10$i rpm -Uvh http://satellite6/pub/katello-ca-consumer-latest.noarch.rpm
ssh node10$i subscription-manager unregister
ssh node10$i subscription-manager register --org 'BusyBox' --name node10$i.domain.com --activationkey 'AK-RHEL7-CEPH' --serverurl=https://satellite6.domain.com:443/rhsm --baseurl=https://satellite6.domain.com/pulp/repos –force
ssh node10$i yum -y install katello-agent
done
[root@node1099 ~]# for i in 01 02 03 11 12 13 14 91 92 99
do
ssh node10$i yum repolist
done
Affiche
repo id repo name status
!rhel-7-server-extras-rpms/x86_64 Red Hat Enterprise Linux 7 Server - Extras (RPMs) 778
!rhel-7-server-optional-rpms/x86_64 Red Hat Enterprise Linux 7 Server - Optional (RPMs) 13,444
!rhel-7-server-rhceph-1.3-calamari-rpms/x86_64 Red Hat Ceph Storage Calamari 1.3 for Red Hat Enterprise Linux 7 Server (RPMs) 20
!rhel-7-server-rhceph-1.3-installer-rpms/x86_64 Red Hat Ceph Storage Installer 1.3 for Red Hat Enterprise Linux 7 Server (RPMs) 91
!rhel-7-server-rhceph-3-mon-rpms/x86_64 Red Hat Ceph Storage MON 3 for Red Hat Enterprise Linux 7 Server (RPMs) 98
!rhel-7-server-rhceph-3-osd-rpms/x86_64 Red Hat Ceph Storage OSD 3 for Red Hat Enterprise Linux 7 Server (RPMs) 84
!rhel-7-server-rhceph-3-tools-rpms/x86_64 Red Hat Ceph Storage Tools 3 for Red Hat Enterprise Linux 7 Server (RPMs) 138
!rhel-7-server-rpms/x86_64 Red Hat Enterprise Linux 7 Server (RPMs) 18,257
!rhel-7-server-satellite-tools-6.2-rpms/x86_64 Red Hat Satellite Tools 6.2 (for RHEL 7 Server) (RPMs) 143
!rhel-server-rhscl-7-rpms/x86_64 Red Hat Software Collections RPMs for Red Hat Enterprise Linux 7 Server 9,336
repolist: 42,389
Uploading Enabled Repositories Report
Loaded plugins: product-id
Network Configuration:Activation IPV6
do
ssh node10$i sysctl net.ipv6.conf.all.disable_ipv6=0; sysctl net.ipv6.conf.default.disable_ipv6=0
done
Ajouter ces deux lignes au fichier /etc/sysctl.conf
net.ipv6.conf.default.disable_ipv6=0
Deploiement CEPH via Ansible
Installation de du package ceph-ansible sur le serveur d'administration node1099
Configuration du fichier inventaire Ansible /etc/ansible/hosts
node1001
node1002
node1003
[osds]
node1011
node1012
node1013
node1014
[mgrs]
node1001
node1002
node1003
[clients]
node1099
Configuration des fichier de variables Ansible group vars
ln -s /usr/share/ceph-ansible/group_vars/ /etc/ansible/group_vars
cd /usr/share/ceph-ansible/
cp site.yml.sample site.yml
cd /usr/share/ceph-ansible/group_vars/
cp all.yml.sample all.yml
cp osds.yml.sample osds.yml
cp clients.yml.sample clients.yml
Modifier le fichier de variable globale /usr/share/ceph-ansible/group_vars/all.yml et vérifier son contenu avec la commande suivante :
Note: réseau publique 10.1.1.0/24 (eth0) et réseau privé 10.1.2.0/24 (eth1)
---
dummy:
fetch_directory: ~/ceph-ansible-keys
ceph_repository_type: cdn
ceph_origin: repository
ceph_repository: rhcs
ceph_rhcs_version: 3
monitor_interface: eth0
ip_version: ipv4
public_network: 10.1.1.0/24
cluster_network: 10.1.2.0/24
ceph_conf_overrides:
mon:
mon_allow_pool_delete: true
Modifier le fichier de variable des OSD et vérifier sont contenu avec la commande suivante :
Note: /dev/sd[c-g] disques OSD et /dev/sdb disque des journaux
---
dummy:
osd_auto_discovery: false
osd_scenario: non-collocated
devices:
- /dev/sdc
- /dev/sdd
- /dev/sde
- /dev/sdf
- /dev/sdg
dedicated_devices:
- /dev/sdb
- /dev/sdb
- /dev/sdb
- /dev/sdb
- /dev/sdb
Deploiement du cluster en tant que cephadm
Temps: 15 minutes
[cephadm@node1099 ~]$ mkdir ~/ceph-ansible-keys
[cephadm@node1099 ~]$ cd /usr/share/ceph-ansible
[cephadm@node1099 ceph-ansible]$ ansible-playbook site.yml
[cephadm@node1099 ceph-ansible]$ cd -
[cephadm@node1099 ~]$ sudo cp ceph-ansible-keys/260aec2e-df73-4490-94db-a25672048061/etc/ceph/ceph.client.admin.keyring /etc/ceph
Vérifier l'état de santé du cluster
HEALTH_OK
[cephadm@node1099 ~]$ sudo ceph -s
cluster:
id: 260aec2e-df73-4490-94db-a25672048061
health: HEALTH_OK
services:
mon: 3 daemons, quorum node1001,node1002,node1003
mgr: node1001(active), standbys: node1002, node1003
osd: 20 osds: 20 up, 20 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 bytes
usage: 2155 MB used, 197 GB / 199 GB avail
pgs:
A suivre, CrushMap 2 DC ...
Shared Storage Pool node not a member of any cluster.
a customer needed to restart two old nodes cluster from VIOS Shared Storage Pool, but the second node never join the cluster. and storage pool was only active on first node.
The vty console display error : 0967-043 This node is currently not a member of any cluster.
I check console log and one line has oriented my diagnostic.
# alog -ot console
0 Fri Nov 25 15:59:08 CET 2016 0967-112 Subsystem not configured.
0 Fri Nov 25 16:01:07 CET 2016 0967-043 This node is currently not a member of any cluster.
0 Fri Nov 25 16:03:10 CET 2016 0967-043 This node is currently not a member of any cluster.
0 Fri Nov 25 16:03:17 CET 2016 JVMJ9VM090I Slow response to network query (150 secs), check your IP DNS configuration
I check /etc/resolv.conf and found that the DNS entry was not reachable. I change DNS entry with first node DNS configuration, restart second node and it join the cluster normally.
emgr failure with noclobber ksh option
A customer never can apply a emergency fix with a lot of "file already exists" error.
The cause was noclobber ksh option was set in the ~/.kshrc of root user.
/usr/sbin/emgr[11177]: /tmp/emgrwork/10617068/.epkg.msg.buf.10617068: file already exists
/usr/ccs/lib/libc.a:
emgr: 0645-007 ATTENTION: /usr/sbin/fuser -xf returned an unexpected result.
/usr/sbin/emgr[11177]: /tmp/emgrwork/10617068/.global.warning.10617068: file already exists
emgr: 0645-007 ATTENTION: inc_global_warning() returned an unexpected result.
emgr: 0645-007 ATTENTION: isfopen() returned an unexpected result.
/usr/sbin/emgr[11177]: /tmp/emgrwork/10617068/.global.warning.10617068: file already exists
emgr: 0645-007 ATTENTION: inc_global_warning() returned an unexpected result.
/usr/sbin/emgr[11177]: /tmp/emgrwork/10617068/.epkg.msg.buf.10617068: file already exists
...
1 IV81303s1a INSTALL FAILURE
Workaround
check if noclobber ksh option is active.
Remove noclobber option from ~/.kshrc or ~/.profile of root user and open a new terminal.
set +o noclobber
HMC V7R7.9.0 SP3 MH01659 « ssl_error_no_cypher_overlap »
On HMC V7R7.9.0 SP3, don't apply e-fix MH01659, it contains a lot of bugs.
If you really need MH01659, then apply e-fix MH01635 before. ( otherwise a ASM connetion timeout and blank page occur with IBM Power5).
=> MH01659.readme.html
Note: This package includes fixes for HMC Version 7 Release 7.9.0 Service Pack 3. You can reference this package by APAR MB04044 and PTF MH01659. This image must be installed on top of HMC Version 7 Release 7.9.0 Service Pack 3 (MH01546) with MH01635 installed.
MH01659 Impact - Known Issues :
After installing PTF MH01659 and the Welcome page loads on the local console, clicking "Log on and Launch" results in the following error:
Problem loading page
An error occurred during a connection to 127.0.0.1.
Cannot communicate securely with peer: no common encryption algorithm(s).
(Error code: ssl_error_no_cypher_overlap)
This defect also impacts the Power Enterprise Pool GUI when launched remotely. Ensure remote access is enabled and the HMC is accessible remotely for management prior to installing this PTF. A fix is planned for a future PTF.
Circumvention: From the HMC home page, Log on by clicking on the "Serviceable Events" link rather than the "Log on and launch the Hardware Management Console web application" link. The "System Status" and "Attention LEDs" links can also be used. Note that the Power Enterprise Pools (PEP) task will not be available from the local console. CLI or remote GUI can be used to perform PEP tasks.
A vterm console window cannot be opened by the GUI on the local HMC console. You can use the mkvterm or vtmenu command on the local HMC console or use the GUI remotely to open a vterm. A fix is planned for a future PTF.
ASM for POWER5 servers will launch a blank white screen and eventually a "Connection timed out" error if PTF MH01635 is not installed prior to MH01644 or MH01659. The install order and supersedes lists have been updated to include PTF MH01635 prior to installing either MH01644 or MH01659.
emgr secret flag for EFIXLOCKED
A customer have problem to update one AIX with EFIXLOCKED in some fileset. In normal situation fileset is locked by a emergency fix, and before apply update you must remove e-fix. With this customer no e-fix ... nothing to remove, and /usr/emgrdata was empty.
emgr command is a korn shell script that contains secret flag "-G" for unlock fileset with unlock_installp_pkg() function.
emt || exit 1
unlock_installp_pkg "$OPTARG" "0"; exit $?;; # secret flag
########################################################################
## Function: unlock_installp_pkg
## Parameters: <PKG> <EFIX NUM>
########################################################################
if no e-fix was installed, then use anything for second argument $2 like "unlock"
/etc/objrepos:bos.rte.libc:7.1.4.0::COMMITTED:F:libc Library:EFIXLOCKED
/usr/lib/objrepos:bos.rte.libc:7.1.4.0::COMMITTED:F:libc Library:EFIXLOCKED
# emgr -G bos.rte.libc unlock
Initializing emgr manual maintenance tools.
Explicit Lock: unlocking installp fileset bos.rte.libc.
Also -D flag display Debug mode with set -x.
AIX SCSI-2 Reservation on INFINIDAT
A customer encountered a problem with Disaster Recovery plan for AIX rootvg boot on SAN (reserve_policy=single_path) on Infinidat model F6130.
Problem: No boot disk on SMS menu.
Workaround: From the Infinidat Storage Unmap & Map rootvg LUN to the host.
Fix: Infinidat corrected Bug in Firmware 2.2.10.12 and add also a internal Script for SCSI-2 reservation.
Before the Fix Infinidat Storage managed only SCSI-3 reservation and AIX use SCSI-2 reservation.
Oracle RAC 11gR2 need multicast
After AIX / Oracle RAC migration to new Datacenter, the DBA encountered a problem to start Oracle RAC with network heartbeat error...
Root cause: Network Team has dropped multicast beetween Datacenter.
2016-09-11 21:15:55.296: [ CSSD][382113536]clssnmPollingThread: node 2, orac002 (1) at 90% heartbeat fatal, removal in 2.950 seconds, seedhbimpd 1
2016-09-11 21:15:56.126: [ CSSD][388421376]clssnmvDHBValidateNCopy: node 2, orac002, has a disk HB, but no network HB, DHB has rcfg 269544449, wrtcnt, 547793, LATS 2298
Workaround: Run tcpdump and grep multicast MAC on Oracle Interconnect Interface and send multicast address for Network Team. That's work better after.
21:43:24.678611 76:82:b2:99:ac:0b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 1202: 192.168.10.217.42424 > 224.0.0.251.42424: UDP, length 1160
21:43:24.678798 76:82:b2:99:ac:0b > 01:00:5e:00:01:00, ethertype IPv4 (0x0800), length 1202: 192.168.10.217.42424 > 230.0.1.0.42424: UDP, length 1160
Oracle Doc ID 1212703.1
Bug 9974223 : GRID INFRASTRUCTURE NEEDS MULTICAST COMMUNICATION ON 230.0.1.0 ADDRESSES WORKING
How to check memory and core activated via CUoD Activation Code
Go to IBM capacity on demand, enter type and serial number and check POD and MOD lines.
Ex 1: model 9117 type MMD
POD 53C1340827291F44AAF4000000040041E4 09/27/2015
AAF4 = CCIN = 4.228 GHz core
04 = 4 core activated
MOD 2A2A7F64BEEEC606821200000032004187
8212 = Feature code = Activation of 1 GB
32 = 32 GB activated
Ex 2: model 8233
POD 80FF07034C0917FA771400000016004166 09/17/2010
7714 = Feature code = 3.0 GHz core
16 = 16 core activated
Source :
Thank's to Mr Delmas
for CCIN reference check IBM Knowledge Center
for Feature code reference check IBM sales manual
HMC Save Upgrade Data failed
If you want to upgrade HMC and Saves Hardware Management Console (HMC) upgrade data failed with HSCSAVEUPGRDATA_ERROR, then check if the home directory of hscroot or other hmcsuperadmin are filled with Virtual I/O server ISO images. The filesystem (/mnt/upgrade) is used to store save upgrade data backup and it is to small to contains ISO images.
Fix: remove VIOS ISO images from HMC and relauch saveupgdata command.