Understanding System and Architecture for Big Data

简介:IBM Research最近在Big Data领域有很多工作,例如我们组在4月份在10台采用POWER7处理器的P730服务器上成功地用14分钟跑完了1TB数据的排序(7月份又在10台Power7R2上用8分44秒跑完了1TB排序),这项工作已经发表为一篇IBM Research Report,欢迎大家围观,并提出宝贵意见,谢谢。

The use of Big Data underpins critical activities in all sectors of our society. Achieving the full transformative potential of Big Data in this increasingly digital world requires both new data analysis algorithms and a new class of systems to handle the dramatic data growth, the demand to integrate structured and unstructured data analytics, and the increasing computing needs of massive-scale analytics. In this paper, we discuss several Big Data research activities at IBM Research: (1) Big Data benchmarking and methodology; (2) workload optimized systems for Big Data; (3) case study of Big Data workloads on IBM Power systems. In (3), we show that preliminary infrastructure tuning results in sorting 1TB data in 14 minutes on 10 Power 730 machines running IBM InfoSphere BigInsights. Further improvement is expected, among other factors, on the new IBM PowerLinuxTM 7R2 systems.

By: Anne E. Gattiker, Fadi H. Gebara, Ahmed Gheith, H. Peter Hofstee, Damir A. Jamsek, Jian Li, Evan Speight, Ju Wei Shi, Guan Cheng Chen, Peter W. Wong

Published in: RC25281 in 2012

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

下载链接

Questions about this service can be mailed to reports@us.ibm.com .

简介:IBM Research最近在Big Data领域有很多工作,例如我们组在4月份在10台采用POWER7处理器的P730服务器上成功地用14分钟跑完了1TB数据的排序(7月份又在10台Power7R2上用8分44秒跑完了1TB排序),这项工作已经发表为一篇IBM Research Report,欢迎大家围观,并提出宝贵意见,谢谢。

The use of Big Data underpins critical activities in all sectors of our society. Achieving the full transformative potential of Big Data in this increasingly digital world requires both new data analysis algorithms and a new class of systems to handle the dramatic data growth, the demand to integrate structured and unstructured data analytics, and the increasing computing needs of massive-scale analytics. In this paper, we discuss several Big Data research activities at IBM Research: (1) Big Data benchmarking and methodology; (2) workload optimized systems for Big Data; (3) case study of Big Data workloads on IBM Power systems. In (3), we show that preliminary infrastructure tuning results in sorting 1TB data in 14 minutes on 10 Power 730 machines running IBM InfoSphere BigInsights. Further improvement is expected, among other factors, on the new IBM PowerLinuxTM 7R2 systems.

By: Anne E. Gattiker, Fadi H. Gebara, Ahmed Gheith, H. Peter Hofstee, Damir A. Jamsek, Jian Li, Evan Speight, Ju Wei Shi, Guan Cheng Chen, Peter W. Wong

Published in: RC25281 in 2012

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

Download link: http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/f085753cf57c8c35852579e90050598f!OpenDocument&Highlight=0,big,data

Questions about this service can be mailed to reports@us.ibm.com .

Facebook的Realtime Hadoop及其应用

在今年的SIGMOD‘11上,Facebook又发了一篇新paper(点此下载),讲述了它们在提高Hadoop实时性上的工作及其应用。简单来讲,他们的项目需求主要有:

1. Elasticity(伸缩性)
2. High write throughput(高写吞吐量)
3. Efficient and low-latency strong consistency semantics within a data center(单个data center内高性能、低延迟的强一致性)
4. Efficient random reads from disk(disk的高性能随机读)
5. High Availability and Disaster Recovery(高可靠性、灾后恢复能力)
6. Fault Isolation(错误隔离)
7. Atomic read-modify-write primitives(read-modify-write原子操作)
8. Range Scans(范围扫描)

阅读全文>>

在今年的SIGMOD‘11上,Facebook又发了一篇新paper(点此下载),讲述了它们在提高Hadoop实时性上的工作及其应用。简单来讲,他们的项目需求主要有:

1. Elasticity(伸缩性)
2. High write throughput(高写吞吐量)
3. Efficient and low-latency strong consistency semantics within a data center(单个data center内高性能、低延迟的强一致性)
4. Efficient random reads from disk(disk的高性能随机读)
5. High Availability and Disaster Recovery(高可靠性、灾后恢复能力)
6. Fault Isolation(错误隔离)
7. Atomic read-modify-write primitives(read-modify-write原子操作)
8. Range Scans(范围扫描)

最终他们选择了Hadoop和HBase作为解决方案的基石,因为HBase已经满足了上述需求中的大部分。与此同时,他们还做了如下三点改进以满足实时性需求:
1. File Appends
2. Name Node的高可靠性优化 (AvatarNode)
3. HBase的读性能的优化

文章还列举了三个基于此方案的应用:Facebook Message,Facebook Insight,Facebook Metric Systems,大家可以着重看看这三个应用的特点及需求是怎样被这个方案满足的。

在现在这个时代,只有大公司才有如此大的数据来做新东西,难怪Facebook,Google的paper被大量追捧了。

参考资料:
[1] Facebook’s New Realtime Analytics System: HBase To Process 20 Billion Events Per Day
[2] Real Time Analytics for Big Data: An Alternative Approach

下面是这篇文章的slides:

Jeff Dean关于Google系统架构的讲座

上个月Jeff Dean在Standford的Computer Systems Colloquium (EE380)这门讨论课上详细讲了讲Google的系统架构发展过程,因为这是份很新的资料,所以特意把它的Slide下下来与大家分享一下。这门课是Standford的讲座课程,每一节课都由不同的顶级工程师/科学家/投资人前来讲授IT行业的最新动向,非常非常有料,绝对值得深挖。这门课的每节课都是带视频的,Jeff Dean的这个讲座的录像在这里

阅读全文>>

上个月Jeff Dean在Standford的Computer Systems Colloquium (EE380)这门讨论课上详细讲了讲Google的系统架构发展过程,因为这是份很新的资料,所以特意把它的Slide下下来与大家分享一下。这门课是Standford的讲座课程,每一节课都由不同的顶级工程师/科学家/投资人前来讲授IT行业的最新动向,非常非常有料,绝对值得深挖。这门课的每节课都是带视频的,Jeff Dean的这个讲座的录像在这里。想要下载该视频的同学可以去这里(要会功夫,你懂的)。

这个讲座的主要内容包括:
• Evolution of various systems at Google
– computing hardware
– core search systems
– infrastructure software

• Techniques for building large-scale systems
– decomposition into services
– design patterns for performance & reliability

个人的一点小感想:Jeff Dean在Google的这几年能面临这么多有意思的挑战,编程模型,可靠性,伸缩性,运行时环境等等等等,真是羡煞旁人。随着Google业务的扩展,整个系统的设计也面临各种各样新的挑战。只有有了扎实的基本功,在面对没有现成解决方案的新问题时才能游刃有余,做工程是如此,做研究更是如此。

可能有些同学会因为这是个英语的讲座而头疼。我觉得大家可以坚持看,哪个单词看不懂的就查字典,刚开始可能痛苦点,但是只要坚持下去,积少成多,你就会发现自己的英语慢慢就上来了,至少看这些英文slides是没问题了。

Building Software Systems at Google and Lessons Learned

另外还有几个关于Jeff Dean的Google架构的博文:
Jeff Dean 在WSDM 2009上面的演讲 Keynote 和视频终于出来了
来自Jeff Dean的分布式系统设计模式(更新版)
Jeff Dean的Stanford演讲

我还发现了Jeff另外一个在09年做的类似主题的讲座,内容稍有重复,但是可以算是一个补充,例如这个里面包括了BigTable等内容。

Enjoy!