

最近有两次难忘的Debug经历。一次是由于系统重装了OS,某些系统配置变化了,导致Hadoop上的Terasort跑不通。问题的表面现象表现为,该节点/home所挂载的磁盘在Terasort运行时出现大量I/O操作,而不是hadoop真正写data的分区/data,从而极大影响性能。本来如果正常的话,该节点的/home分区是不会出现I/O的。用iotop等工具只能看到Hadoop的JVM对/home分区造成了巨大的I/O操作,但是究竟为何这些JVM会对/home而不是/data做大量操作?这到底是哪个配置的错误造成的?牵涉到这种reasoning的debug,好像还没有很好的工具能帮上忙。最后解决这个bug是通过不断调整Terasort的参数,不断试错发现的:在一次关闭JVM Huge Page后的测试时Terasort就能正常运行,从而锁定HugePage的相关设定,最后发现是因为重装系统后该用户名的group id变了,所以被allocate的HugePage并不能被该用户的JVM所使用,从而导致内存不足进而产生大量swap,才会出现/home目录大量I/O的情形。

第二次是在集群上测试是发现一台节点CPU会有很异常的WAIT时间。用sysbench进行file I/O测试能复现这个bug。既然CPU有wait,那么很可能是disk有问题。用nmon分析了该机器的磁盘组的lvm数据后发现/dev/sdb设备有故障,会出现只有这个设备I/O busy而其它LVM里面的磁盘却空闲的情形。之后把该磁盘从LVM中删除,重做RAID 0,搞定了这个bug。

How to do performance analysis on your parallelized program efficiently?

Be a scientist: Gather data. Analyze it. Especially when it comes to parallelism and scalability, there’s just no substitute for the advice to measure, measure, measure, and understand what the results mean. Putting together test harnesses and generating and analyzing numbers is work, but the work will reward you with a priceless understanding of how your code actually runs, especially on parallel hardware—an understanding you will never gain from just reading the code or in any other way. And then, at the end, you will ship high-quality parallel code not because you think it’s fast enough, but because you know under what circumstances it is and isn’t (there will always be an “isn’t”), and why.

Herb Sutter
