骨灰级程序员章炎简历赏析（百度，友盟, Castbox）

山东大学（2003-2010）本科，硕士

由于是英文版，下面用中文描述一下：

Software Engineer of Backend Development

公司名称 Castbox

入职日期2016 年 3 月 - 至今任职时长4 年 3 个月

所在地点Beijing City, China

- crawler system. The software stacks include Python, Requests, Beautifulsoup, FeedParser, Squid, Redis, MongoDB, etc. By collecting RSS feeds submitted by users and search query, the number of podcasts in the database has been increased from 200K to 600K, and the number of episodes has been increased from 20M to 40M. Meanwhile, we optimize images of episodes by compression and cropping, reduce the size of images from MB to less than 300KB without much loss on image quality, which saves network traffic and reduces image loading time on the mobile app. By applying a heuristic algorithm, we reduce the average latency of an episode from it's released by podcaster to it's shown on our platform, from 3 hours to 30 minutes.

- search system. It's developed on ElasticSearch and supports up to 12 languages including English, Portuguese, Spanish, German, Dutch, CJK, etc. We put much effort into improving and optimizing the search system from the following aspects. 1) index freshness. The latency of the whole pipeline is less than 10 seconds and more than 20k episodes are indexed per day. 2) search latency. By using cache effectively and fine-tuning Elasticsearch, we control the latency of search API under 200ms and the latency of suggestion API under 10ms. 3) search relevance. Besides document relevance score returned by Elasticsearch, we add many signals including play numbers, subscription numbers, ctr of search results in the past in recent days, to get a better relevance score.

- recommender system. We use LightFM python library and apply WARP algorithm on user subscription data in recent 3 months. With fine-tuning of parameters and A/B Testing, we raise CTR of user recommended podcasts from 2.16% to 4.52%, and CTR of similar podcasts from 1.90% to 3.19%.

（推荐系统，使用lightFM python库，是WARP算法在用户发布数据在最近三个月，定义了测试参数，提高了预估点击率）

Senior Software Architect

公司名称 Umeng

入职日期2012 年 6 月 - 2016 年 3 月任职时长3 年 10 个月

所在地点 China

- kvproxy, an asynchronous high-performance HTTP server for easily accessing various database systems such as HBase, MySQL, Riak etc. It's written in Scala and Finagle, use Google Protocol-Buffers as data exchange format and Google Guava LRUCache as the application-level cache. Since Finagle wraps an asynchronous function in a concept of 'Future' and encourages the developer to take server as a function(Your Server as a Function. http://monkey.org/~marius/funsrv.pdf), so kvproxy could be used not only as a server but also a library that could be easily embedded into other applications.

- performance tuning of MapReduce jobs and Hadoop cluster usage from perspectives of
1. application. use HBase bulk-loading instead of writing data to HBase directly for better throughput and stability.
2. algorithm. use HyperLogLog algorithm instead of using set to calculate cardinality for better performance and any-time-range query ability.
3. system. turn off MapReduce speculative mode when reading data from HBase.
4. language. use JNI instead of pure Java code to accelerate CPU computation.

使用hbase的bulk加载代替了直接写数据，提高了吞吐量和稳定性。

使用hyperloglog算法，替代了使用计算基数。

关掉了mapreduce模式在从hbase中读数据的时候，使用了jni替代了Java代码。

- FastHBaseRest, an asynchronous high-performance HTTP server written in Netty for easily accessing HBase in multiple languages by using Google Protocol-Buffers. Comparing to HBase embedded HTTP server('hbase rest'), the access latency is 20% lower and transfer size is 40% lower. Meanwhile, it has more capabilities like request rewriting.

- usched, an internal job scheduler system to arrange jobs which are codependent. It defines and implements a DSL called JDL(Job Description Language) which is used to describe dependencies between jobs and properties of jobs. It runs as an HTTP server and provides a web-console to manage jobs including submissions and running status dashboard etc. Thousand MapReduce jobs are scheduled by USched each day while the latency is below 5sec.

内部排版系统，定义了任务描述语言，从而独立了任务和任务特性。提供了web控制台，包括管理子任务，运行仪表盘状态。

Contract Software Engineer

公司名称 LogZilla

入职日期2015 年 4 月 - 2015 年 8 月任职时长5 个月

所在地点Tianjin City, China

A real-time event analytical platform.

- performance tuning to support ~200k eps(event per second).
- implement a new event storage engine to support ~1M eps(event per second).

Contract Software Engineer

公司名称 Codership - Galera Cluster

入职日期2014 年 4 月 - 2014 年 11 月任职时长8 个月

所在地点Tianjin City, China

A drop-in plugin of MySQL multi-master.

Optimize cluster recovery process regarding data center outage case, and reduce recovery time from the 30s to less than 3s.

公司名称 Baidu

入职日期 2014 年 4 月 - 2014 年 11 月任职时长8 个月

所在地点 Beijing City, China

2008.6-2012.6 Baidu Inc. Senior Developer engineer

- dstream, an in-house distributed real-time stream processing system in C++ like Twitter's Storm and Yahoo!'s S4. The alpha version of 10 nodes cluster can process 1 million tuples per second while keeping the latency less than 100ms.

(内部使用的分布式实时流处理系统，c++编写，类似于twitter的storm和雅虎的s4，alpha版本有十个节点的集群，可以处理一百万事件，同时少于100ms的延迟）

- comake2, an in-house build system in Python, takes advantages of some open-source build systems such as SCons, CMake, Google's GYP, Boost's Jam etc. It has been wildly used in Baidu for continuous integration.

（内部使用的python编写的系统，利用开源系统构建，广泛用于百度的持续集成之中，其实我觉得跟那个Jenkins差不多，百度就喜欢重复造轮子，到头来，发现还是开源香）

- infpack, an in-house data exchange format in C++. Comparing to Google's Protocol-Buffers and Facebook's Thrift, the speed of serialization and deserialization is about 20~30% faster while size is 10~20% smaller. The generated code is carefully hand-tuned so implementation is very efficient.

（内部使用的是聚焦环格式，类比于谷歌的probuf，和facebook的thrift，序列化速度大于快20%，大小10%，产生代码更有效）

- ddbs(distributed database system), an in-house distributed relational database system. I mainly worked on SQL parser to extend syntax for more capability and implementing a SPASS(single point automatic switch system) for its fault-tolerant feature.

(内部使用的关系型数据库，主要扩展了语法，实现了单点自动转换来提高容错的特性）

- maintainer and developer of Baidu common libraries including BSL(Baidu standard library), ullib(wraps socket io, file io, and some Linux syscalls etc.), comdb(an embedded high-performance key-value storage system), memory allocator, character encoding, regular expression, signature and hash algorithm, URL handling, HTTP client, lock-free data structures and algorithms etc.

（百度库的维护者，例如百度标准库，socket库，文件库，linux系统调用库，嵌入式高性能kv库，内存分配，字符编码，正则匹配，签名哈希算法，url，http客户端，自有锁系统）

- vitamin, an in-house tool to detect the potential bugs in C/C++ source code by static analyzation. It reports thousands of valuable warnings by scanning the whole of Baidu's code repository while keeping the rate of fake warnings relatively low.

（内部使用的潜在bug源码系统，报告了百万的有价值的警告，存在于百度代码中）

- IDL compiler, an in-house compiler translates a DSL(domain specified language) to the code that supports data exchange between C/C++ struct/class and Mcpack(an in-house data pack like Google's Protocol-Buffers) using Flex and Bison.

（内部编译器，支持c++的数据交换）

可以看出百度真的是闲的蛋疼，开发这么一大堆重复功能的组件来消耗股民的钱。