vio_daemon consuming high memory on VIOS
This saturday I update two dual VIOS from 2.2.2.1 to combinated 2.2.3.1 + 2.2.3.50 + 2.2.3.52 Fixpack.
one VIOS was using lot of memory (8GB of computational), svmon show that vio_daemon used 12 segments of application stack (it's a joke). In fact, the customer had modified /etc/security/limits with stack, data and rss unlimited for root. Solved by Setting default value and reboot VIOS. See IBM technote.
Question
Why is vio_daemon consuming high memory on PoweVM Virtual I/O Server (VIOS)?
Cause
There is a known issue in VIOS 2.2.3.0 thru 2.2.3.3 with vio_daemon memory leak that was fixed at 2.2.3.4 with IV64508.
Answer
To check your VIOS level, as padmin, run:
$ ioslevel
If your VIOS level is 2.2.3.4 or higher, the problem may be due to having values in /etc/security/limits set to "unlimited" (-1). Particularly, the "stack" size setting, which exposes a condition where the system can be allowed to pin as much stack as desired causing vio_daemon to consume a lot of memory.
# vi /etc/security/limits ->check the default stanza
default:
fsize = -1
core = -1
cpu = -1
data = -1
rss = -1
stack = -1
nofiles = -1
In some cases, the issue with vio_daemon consuming high memory is noticed after a VIOS update to 2.2.3.X. However, a VIOS update will NOT change these settings. It is strongly recommended not to modify these default values as doing so is known to cause unpredictable results. Below is an example of the default values:
fsize = 2097151
core = 2097151
cpu = -1
data = 262144
rss = 65536
stack = 65536
nofiles = 2000
To correct the problem change the settings back to "default" values. Then reboot the VIOS at your earliest convenience.
Note 1
If the stack size was added to the root and/or padmin stanzas with unlimited setting, it should be removed prior to rebooting the VIOS.
Note 2
If there clients are not redundant via a second VIOS, a maintenance window should be schedule to bring the clients down before rebooting the VIOS.
SOURCE: IBM technote
Drive paths to library client taken offline when server option SANDISCOVERY set to ‘YES’
Technote (troubleshooting)
Problem(Abstract)
This message in the activity log of the library manager appears: ANR1772E The path from source
Symptom
On the library client, these messages are observed in the activity log when a library sharing session is opened to the library manager:
ANR3626W A check condition occurred during a small computer system interface (SCSI) inquiry at Fibre Channel port WWN=<wwn_number> , KEY=00, ASC=00, ASCQ=00.
ANR1786W HBAAPI not able to get adapter name.
ANR8963E Unable to find path to match the serial number defined for drive <DRIVE_NAME> in library <LIBRARY_NAME>.
ANR8873E The path from source <library_client> to destination <drive> (/dev/rmtXYZ) is taken offline.
On the library manager, you can see this corresponding message showing the path to the drive being taken offline:
Cause
The SAN discovery's query of the HBA has timed out, and the path is taken offline. This can occur in SAN environments with a large number of devices.
Diagnosing the problem
Verify that there is not an underlying hardware problem causing the drives paths to go offline.
Check the value of the SANDISCOVERYTIMEOUT option on the library clients and the library manager. The default value is 15 seconds:
Resolving the problem
If the value of the option is at or near the default value of 15 seconds, increase to a greater number. For example:
Why Are Tapes with PRIVATE Status Not Found in QUERY VOLUME Output?
Technote (FAQ)
Question
QUERY LIBVOLUME shows tape volumes with status of PRIVATE, but the same volumes do not show up with the command: Q VOL
Why are these tapes PRIVATE?
Answer
QUERY VOLUME will only return information about volumes that belong to stgpools, but there are other types of volumes that can have valid data on them: DB backups, exports, backupsets and remote volumes that belong to a Library Client server.
The volume history will keep a record of all volumes and you can display these other types of non-stgpool volumes with the following commands:
q volh type=dbs
q volh type=export
q volh type=backupset
q volh type=remote
If a PRIVATE volume is not part of a stgpool and does not display in any of the above Q VOLH commands then you can set it to scratch using the command:
If you do have a library sharing environment it is recommended to run an AUDIT LIBRARY on the Library Client servers prior to changing the status of a volume to scratch on the Library Manager server.
Using ‘dd’ to verify Tivoli Storage Manager tape volume labels
Question
How can I use the Unix 'dd' command to verify a tape volume label?
Answer
The first step that may be necessary to verify a tape volume label is to find out the block size in use on that tape volume. This parameter is typically set on the physical tape library console interface and will vary between manufacturers so the ideal place to search is on the manufacturer website.
There is a method to manually find out the block size as follows:
On most Unix systems, the 'dd' command will output a message indicating a read failed from a tape drive (and corresponding tape volume) along with insufficient memory message. For example on AIX:
dd: 0511-051 The read failed.
: There is not enough memory available now.
0+0 records in.
0+0 records out.
The 'if' parameter must reference a valid path to a drive that contains the volume you are seeking information about. This volume may be loaded using a utility such as tapeutil or directly from the physical library's console management. The drive should not be in use by Tivoli Storage Manager at the time the command is run and it is recommended to take the drive offline to Tivoli Storage Manager.
The 'ibs' parameter indicates the block size to use in bytes unless a 'k' is specific, in which case the parameter is read as kilobytes. A value of 32 bytes, as in the example above, is a good starting value. If this command returns a memory related error message then the value can be doubled.
dd: 0511-051 The read failed.
: There is not enough memory available now.
0+0 records in.
0+0 records out.
The value specified for the block size is still smaller than what is actually on the volume so another error is generated. The value must be increased (each time doubling it) until no error message is reported:
0+1 records in.
0+1 records out.
Once the correct block size has been discovered, the 'dd' command should not generate a memory error when reading from the volume.
Now that the block size is known, the data on the first block of the tape can be dumped to a file:
dd bs=
For example:
0+1 records in.
0+1 records out.
Once the file is file '/tmp/block1.out' is written, the file may be viewed in any text editor or the cat command can be used:
VOL1200312
In this case the 'VOL1200312' is the label of the tape volume residing in the drive /dev/rmt1.
Source: IBM Technote
NMON Visualizer
Pour ceux qui utilise l'outil de collecte de performance nmon développé par Nigel Griffiths et disponible pour IBM AIX/VIOS et Linux (Power, x86, x86_64, Mainframe & now ARM (Raspberry Pi)), comme vous le savez il faut se servir du fichier excel nmon analyser pour visualiser les fichiers de collecte nmon.
En complément de cet outil je vous conseille de tester NMONVisualizer, un projet IBM démarré par Hunter Presnall qui est un excelent outil pour comparer et analyser les fichiers nmon issue des collectes de performances de plusieurs systèmes ou VM AIX / Linux.
NMONVisualizer http://nmonvisualizer.github.io/nmonvisualizer/index.html
NMON
https://www.ibm.com/developerworks/aix/library/au-analyze_aix/
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Power+Systems/page/nmon
Nmon analyser
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Power%20Systems/page/nmon_analyser
My thanks to you all for a job very well done.
How to add a hardware error in AIX errlog
Comment générer des erreurs hardware dans l'errlog d'AIX. Utile pour tester des logiciels de supervision ou la gestion des évènements sous PowerHA.
Voir le fichier /usr/include/sys/errids.h pour les LABELS errpt.
echo "SCSI_ERR1\nEMULATE\n1\ntexte1\ntexte2" | /usr/lib/ras/ras_logger
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
0502F666 1013162315 P H EMULATE ADAPTER ERROR
74533D1A 1013162115 U H EMULATE LOSS OF ELECTRICAL POWER
Virtual HMC for November 2015
The IBM Power Systems Hardware Management Console (HMC) virtual appliance can be used to manage any of the systems that are supported by the version 8 HMC, which includes Power Systems servers with IBM POWER6, POWER7, and POWER8 processors.
The Power Systems HMC virtual appliance offers these benefits:
Provides hardware, service, and basic virtualization management for your Power Systems servers
Offers the same functionality as the traditional HMC
Runs as a virtual machine on an x86 server virtualized either by VMware ESXi or Red Hat KVM
Source: IBM
AIX Crash always on d_map_list_tce()
Problem
AIX 6100-04-05-1015 Crash always on d_map_list_tce() with with EMC Symmetrix FCP VRAID storage disk.
EMC.Symmetrix.fcp.rte:5.3.0.5
EMCpower.base:5.3.1.1
EMCpower.encryption:5.3.1.1
EMCpower.migration_enabler:5.3.1.1
EMCpower.mpx:5.3.1.1
(3)> stack
pvthread+001B00 STACK:
[041C85A0]d_map_list_tce+000A40 (0000000000000000, 0000000000000000,F1000A0380145AF0, 0000000000001020, 0000000000000001 [??])
[04255138]04255138 ()
[0425A668]efc_start+000508 (??)
[04226D58]efc_intr+0001B8 (??)
[0024280C]i_poll_soft+00012C (??)
[00242180]i_softmod+000480 ()
[0013FB44]flih_util+000250 ()
(3)> dr iar
iar : 00000000041C85A0
041C85A0 tweqi r3,0 r3=0
(3)> lke 041C85A0
ADDRESS FILE FILESIZE FLAGS MODULE NAME
1 F1000000A05F2500 041B0000 00030000 00080252 /usr/lib/drivers/pci/pci_busdd
(3)> symptom
Instruction:
PIDS/5765G6200 LVLS/610 PCSS/SPI1 MS/700 FLDS/d_map_lis VALU/e8610110 FLDS/04255138 VALU/342e6b64
Local Fix
Stop the replication between storage and DR site
IBM APAR
Not yet tested with replication
IBM Spectrum Protect formerly Tivoli Storage Manager
Beginning with Version 7.1.3, IBM Tivoli Storage Manager is now IBM Spectrum Protect™.
Beginning with Version 4.1.3, IBM Tivoli Storage FlashCopy® Manager is now IBM Spectrum Protect™ Snapshot.
Product name cross reference
IBM System Storage Archive Manager | IBM Spectrum Protect for Data Retention |
Tivoli Storage FlashCopy Manager | IBM Spectrum Protect Snapshot |
Tivoli Storage Manager | IBM Spectrum Protect |
Tivoli Storage Manager Extended Edition | IBM Spectrum Protect Extended Edition |
Tivoli Storage Manager FastBack for Workstations | IBM Spectrum Protect for Workstations |
Tivoli Storage Manager FastBack for Workstations - Starter Edition | IBM Spectrum Protect for Workstations - Starter Edition |
Tivoli Storage Manager for Databases | IBM Spectrum Protect for Databases |
Tivoli Storage Manager for Mail | IBM Spectrum Protect for Mail |
Tivoli Storage Manager for Enterprise Resource Planning | IBM Spectrum Protect for Enterprise Resource Planning |
Tivoli Storage Manager for Space Management | IBM Spectrum Protect for Space Management |
Tivoli Storage Manager for Storage Area Networks | IBM Spectrum Protect for SAN |
Tivoli Storage Manager for Virtual Environments | IBM Spectrum Protect for Virtual Environments |
Tivoli Storage Manager HSM for Windows | IBM Spectrum Protect HSM for Windows |
Source: IBM Technote
Scheduled « Backup VM » processes more virtual machines than expected.
Technote (troubleshooting)
Problem(Abstract)
Using an asterisk in the "objects" field of a schedule definition will override any domain.vmfull specifications in the dsm.opt file.
Symptom
Backup VM processes more virtual machines than expected.
Cause
Using an asterisk in the "Objects" line of a schedule definition tells the Tivoli Storage Manager client to backup all virtual machines thus overriding the domain.vmfull options.
Example:
Objects: *
Options: -mode=IFINCR
Environment
Windows or Linux proxy
Any 6.1.x or newer Tivoli Storage Manager server.
Resolving the problem
Remove the asterisk from the schedule definition on the Tivoli Storage Manager server. To do so, update the schedule. For Example:
After this change is saved, then restart the client scheduler service or CAD to ensure the update is in place for the next scheduled backup.