Jekyll2024-01-21T23:08:13+00:00/feed.xmlRye如何让你的C++代码变得更快2024-01-21T00:00:00+00:002024-01-21T00:00:00+00:00/c++/2024/01/21/faster-cpp-code<h2 id="使用初始化列表initializer-lists">使用初始化列表(Initializer Lists)</h2>
<pre><code class="language-C++">std::vector<ModelObject> mos{mo1, mo2};
auto mos = std::vector<ModelObject>{mo1, mo2};
// Don't do this
std::vector<ModelObject> mos;
mos.push_back(mo1);
mos.push_back(mo2);
</code></pre>
<p>初始化列表能帮助减少对象拷贝和容器的大小的扩容。</p>
<h2 id="减少临时对象的创建">减少临时对象的创建</h2>
<pre><code class="language-C++">// Instead of
auto mo1 = getSomeModelObject();
auto mo2 = getAnotherModelObject();
doSomething(mo1, mo2);
// consider:
doSomething(getSomeModelObject(), getAnotherModelObject());
</code></pre>
<p>这能减少编译器的<code class="language-plaintext highlighter-rouge">move</code>操作。</p>
<h2 id="减少拷贝和赋值">减少拷贝和赋值</h2>
<pre><code class="language-C++">// Bad Idea
std::string somevalue;
if (caseA) {
somevalue = "Value A";
} else {
somevalue = "Value B";
}
// Better Idea
const std::string somevalue = caseA ? "Value A" : "Value B";
</code></pre>
<pre><code class="language-C++">// Bad Idea
std::string somevalue;
if (caseA) {
somevalue = "Value A";
} else if(caseB) {
somevalue = "Value B";
} else {
somevalue = "Value C";
}
// Better Idea
const std::string somevalue = [&](){
if (caseA) {
return "Value A";
} else if (caseB) {
return "Value B";
} else {
return "Value C";
}
}();
</code></pre>
<h2 id="避免使用new操作">避免使用<code class="language-plaintext highlighter-rouge">new</code>操作</h2>
<pre><code class="language-C++">// require two heap allocation
std::shared_ptr<ModelObject_Impl>(new ModelObject_Impl());
// should become
std::make_shared<ModelObject_Impl>(); // (it's also more readable and concise)
</code></pre>
<h2 id="优先使用unique_ptr">优先使用<code class="language-plaintext highlighter-rouge">unique_ptr</code></h2>
<pre><code class="language-C++">std::unique_ptr<ModelObject_Impl> factory();
auto shared = std::shared_ptr<ModelObject_Impl>(factory());
</code></pre>
<p>最佳实践是通过工厂函数返回<code class="language-plaintext highlighter-rouge">unique_ptr</code>,在需要的时候再转化为<code class="language-plaintext highlighter-rouge">shared_ptr</code>。</p>
<h2 id="避免使用stdendl">避免使用std::endl</h2>
<p><code class="language-plaintext highlighter-rouge">std::endl</code>包含<code class="language-plaintext highlighter-rouge">flush</code>的操作。</p>
<h2 id="限制局部变量的作用域">限制局部变量的作用域</h2>
<pre><code class="language-C++">// Good Idea
for (int i = 0; i < 15; ++i)
{
MyObject obj(i);
// do something with obj
}
// Bad Idea
MyObject obj; // meaningless object initialization
for (int i = 0; i < 15; ++i)
{
obj = MyObject(i); // unnecessary assignment operation
// do something with obj
}
// obj is still taking up memory for no reason
</code></pre>
<h2 id="使用init-statement">使用<code class="language-plaintext highlighter-rouge">init-statement</code></h2>
<pre><code class="language-C++">if (MyObject obj(index); obj.good()) {
// do something if obj is good
} else {
// do something if obj is not good
}
</code></pre>
<h2 id="char和string">Char和String</h2>
<pre><code class="language-C++">// Bad Idea
std::cout << someThing() << "\n";
// Good Idea
std::cout << someThing() << '\n';
</code></pre>
<p><code class="language-plaintext highlighter-rouge">"\n"</code>会被解析为<code class="language-plaintext highlighter-rouge">const char *</code>并为其做范围检查。单个<code class="language-plaintext highlighter-rouge">\n</code>能避免多余的CPU指令执行。</p>
<h2 id="避免使用stdbind">避免使用<code class="language-plaintext highlighter-rouge">std::bind</code></h2>
<pre><code class="language-C++">// Bad Idea
auto f = std::bind(&my_function, "hello", std::placeholders::_1);
f("world");
// Good Idea
auto f = [](const std::string &s) { return my_function("hello", s); };
f("world");
</code></pre>
<p>推荐使用<code class="language-plaintext highlighter-rouge">lambda</code>,而不是<code class="language-plaintext highlighter-rouge">std::bind</code>。</p>
<h2 id="尽可能使用vector">尽可能使用<code class="language-plaintext highlighter-rouge">vector</code></h2>
<p>在使用容器的时候多问问是否用一个<code class="language-plaintext highlighter-rouge">vector</code>就够?</p>
<h2 id="定义虚拟析构函数">定义虚拟析构函数</h2>
<p>对于任何一个有虚函数的类添加虚拟析构函数,即时它什么都没做。</p>
<h2 id="不要在构造函数和析构函数中调用虚函数">不要在构造函数和析构函数中调用虚函数</h2>
<pre><code class="language-C++">class Transaction {
public:
Transaction();
virtual void log_transaction() const = 0;
};
Transaction:: Transaction() {
...
// log the tranaction
log_transaction();
}
class BuyTransaction: public Transaction {
public:
virtual void log_tranction() const;
...
}
class SellTransaction: public Transaction {
public:
virtual void log_transaction() const;
}
BuyTransaction b;
</code></pre>
<h2 id="引用">引用</h2>
<ul>
<li><a href="https://www.geeksforgeeks.org/move-constructors-in-c-with-examples/">https://www.geeksforgeeks.org/move-constructors-in-c-with-examples/</a></li>
<li><a href="https://blog.quasardb.net/using-c-containers-efficiently">https://blog.quasardb.net/using-c-containers-efficiently</a></li>
<li><a href="https://www.geeksforgeeks.org/virtual-destructor/">https://www.geeksforgeeks.org/virtual-destructor/</a></li>
<li><a href="https://lefticus.gitbooks.io/cpp-best-practices/content/">https://lefticus.gitbooks.io/cpp-best-practices/content/</a></li>
<li><a href="https://www.educative.io/edpresso/what-is-a-move-constructor-in-cpp">https://www.educative.io/edpresso/what-is-a-move-constructor-in-cpp</a></li>
</ul>使用初始化列表(Initializer Lists)Doris2024-01-16T00:00:00+00:002024-01-16T00:00:00+00:00/olap/2024/01/16/doris<h1 id="history">History</h1>
<ul>
<li>Palo, Open source by Baidu in 2017</li>
<li>Rename to Apache Doris in 2018</li>
</ul>
<hr />
<h1 id="related-products">Related Products</h1>
<ul>
<li><a href="https://starrocks.io/">StarRocks</a></li>
<li><a href="https://selectdb.com/">SelectDB</a></li>
</ul>
<hr />
<h1 id="doris-architecture">Doris Architecture</h1>
<p><img src="/assets/images/doris/doris_architecture.png" alt="Inline" /></p>
<hr />
<h1 id="doris-role">Doris Role</h1>
<ul>
<li>Frontend(FE)
<ul>
<li>User request access, query parsing and planning, metadata and node management</li>
</ul>
</li>
<li>Backend(BE)
<ul>
<li>Storage and query plan execution</li>
</ul>
</li>
</ul>
<hr />
<h1 id="storage-engine">Storage Engine</h1>
<ul>
<li>Invert index</li>
<li>Bloom filter</li>
<li>MIN/MAX index
<ul>
<li>Filter equal or range query for numeric type</li>
</ul>
</li>
<li>Z-order index
<ul>
<li>Range query on combined fields</li>
</ul>
</li>
<li>Sorted compound key index</li>
</ul>
<hr />
<h1 id="query-engine">Query Engine</h1>
<p><img src="/assets/images/doris/doris_query_engine.png" alt="Inline" /></p>
<hr />
<h1 id="storage-model">Storage Model</h1>
<ul>
<li>From Google Mesa</li>
</ul>
<hr />
<h1 id="google-mesa">Google Mesa</h1>
<ul>
<li>Designed for advertisement</li>
<li>Pre-Aggregate model</li>
</ul>
<hr />
<h1 id="google-mesa-1">Google Mesa</h1>
<p><img src="/assets/images/doris/mesa-table-examples.png" alt="Inline" /></p>
<hr />
<h1 id="mesa-updates">Mesa Updates</h1>
<p><img src="/assets/images/doris/mesa-updates-examples.png" alt="Inline" /></p>
<hr />
<h1 id="mesa-compaction">Mesa Compaction</h1>
<p><img src="/assets/images/doris/mesa-compaction-policy.png" alt="Inline" /></p>
<hr />
<h1 id="storage-model-1">Storage Model</h1>
<ul>
<li>Aggregate Key Model</li>
<li>Unique Key Model</li>
<li>Duplicate Key Model</li>
</ul>
<p><a href="https://doris.apache.org/zh-CN/docs/data-table/data-model/">https://doris.apache.org/zh-CN/docs/data-table/data-model/</a></p>
<hr />
<h1 id="doris-rollup">Doris Rollup</h1>
<p>Rollup must be based on Aggregate Table.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ALTER TABLE ads ADD ROLLUP `PublisherRollup` (`Date`, PublisherId, Clicks, `Cost`)
</code></pre></div></div>
<p><img src="/assets/images/doris/doris-base-table-and-rollup-example.png" alt="Inline" /></p>
<hr />
<h1 id="doris-materialized-view">Doris Materialized View</h1>
<p>Doris MV can be created based on Duplicate Table.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CREATE MATERIALIZED VIEW `PublisherMView` AS
SELECT `Date`, PublisherId, SUM(Clicks), SUM(`Cost`) FROM ads GROUP BY `Date`, PublisherId
</code></pre></div></div>
<hr />
<h1 id="references">References</h1>
<ul>
<li><a href="https://research.google.com/pubs/archive/42851.pdf">https://research.google.com/pubs/archive/42851.pdf</a></li>
<li><a href="https://doris.apache.org/zh-CN/docs/summary/basic-summary/">https://doris.apache.org/zh-CN/docs/summary/basic-summary/</a></li>
<li><a href="https://tech.youzan.com/clickhouse-zai-you-zan-de-shi-jian-zhi-lu/">https://tech.youzan.com/clickhouse-zai-you-zan-de-shi-jian-zhi-lu/</a></li>
<li><a href="https://www.infoq.cn/article/vxup94ub59ya*k0tnefe">https://www.infoq.cn/article/vxup94ub59ya*k0tnefe</a></li>
<li><a href="https://tech.meituan.com/2020/04/09/doris-in-meituan-waimai.html">https://tech.meituan.com/2020/04/09/doris-in-meituan-waimai.html</a></li>
<li><a href="https://ericfu.me/from-mesa-to-doris/">https://ericfu.me/from-mesa-to-doris/</a></li>
<li><a href="https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/">https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/</a></li>
</ul>History Palo, Open source by Baidu in 2017 Rename to Apache Doris in 2018Gluten2024-01-04T00:00:00+00:002024-01-04T00:00:00+00:00/olap/2024/01/04/gluten<h2 id="what-is-gluten">What is Gluten?</h2>
<ul>
<li><code class="language-plaintext highlighter-rouge">Glue</code> in Latin</li>
<li>Enable Spark with Native Vectorized Execution</li>
<li>Contributed by Intel and Kyligence in 2022
<ul>
<li><a href="https://github.com/oap-project/gazelle_plugin">Gazelle</a></li>
</ul>
</li>
</ul>
<hr />
<h2 id="photon">Photon</h2>
<ul>
<li><code class="language-plaintext highlighter-rouge">SIGMOD 2022</code>: A Fast Query Engine for Lakehouse Systems[1]</li>
<li>Not open source</li>
</ul>
<hr />
<h2 id="why-we-need-it">Why we need it?</h2>
<ul>
<li>IO bound ==> CPU bound</li>
<li>JIT is not enough
<ul>
<li>Spark 1.4: Expression Compute</li>
<li>Spark 2.0: Stage Code Generation (Volcano Model)</li>
</ul>
</li>
<li>Query plan level performance improves, but not operator level</li>
<li>JVM is not good for CPU instruction optimization (like SIMD)</li>
<li>Existing native engine like <code class="language-plaintext highlighter-rouge">volex</code>/<code class="language-plaintext highlighter-rouge">clickhouse</code>/<code class="language-plaintext highlighter-rouge">arrow</code></li>
</ul>
<p><img src="/assets/images/gluten/operator_perf.png" alt="inline" /></p>
<hr />
<h2 id="spark-plugin">Spark Plugin</h2>
<p><img src="/assets/images/gluten/spark_gluten.png" alt="inline" /></p>
<hr />
<h2 id="architecture">Architecture</h2>
<p><img src="/assets/images/gluten/architecture.png" alt="inline" /></p>
<hr />
<h2 id="design-goal">Design Goal</h2>
<ul>
<li>Transform Spark’s stage physical plan to Substrait plan</li>
<li>Offload performance-critical data processing to native library</li>
<li>Define clear JNI interfaces for native libraries</li>
<li>Switch available native backends easily</li>
<li>Reuse Spark’s distributed control flow</li>
<li>Manage data sharing between JVM and native</li>
<li>Extensible to support more native accelerators</li>
</ul>
<hr />
<h2 id="plan-converson--fallback">Plan Converson & Fallback</h2>
<p><img src="/assets/images/gluten/plan_conversion.png" alt="inline" /></p>
<hr />
<h2 id="memory-management">Memory Management</h2>
<p><img src="/assets/images/gluten/memory_management.png" alt="inline" /></p>
<hr />
<h2 id="columnar-shuffle">Columnar Shuffle</h2>
<ul>
<li>Row to column</li>
<li>On Shuffle Read phase</li>
</ul>
<hr />
<h2 id="compability">Compability</h2>
<ul>
<li>Clear JNI interface</li>
<li>Spark Side: shim layer</li>
</ul>
<hr />
<h2 id="performance">Performance</h2>
<p><img src="/assets/images/gluten/velox_perf.png" alt="inline" /></p>
<p><img src="/assets/images/gluten/clickhouse_perf.png" alt="inline" /></p>
<hr />
<h2 id="who-are-using-it">Who are using it?</h2>
<ul>
<li><a href="https://www.volcengine.com/product/las">ByteDance LAS</a>: <code class="language-plaintext highlighter-rouge">Bolt</code></li>
<li><a href="https://www.aliyun.com/product/ApsaraDB/ads">Aliyun AnalyticDB</a></li>
</ul>
<hr />
<h3 id="references">References</h3>
<ol>
<li><a href="https://cs.stanford.edu/people/matei/papers/2022/sigmod_photon.pdf">https://cs.stanford.edu/people/matei/papers/2022/sigmod_photon.pdf</a></li>
<li><a href="https://github.com/oap-project/gluten">https://github.com/oap-project/gluten</a></li>
<li><a href="https://cn.kyligence.io/blog/gluten-spark/">https://cn.kyligence.io/blog/gluten-spark/</a></li>
<li><a href="https://github.com/facebookincubator/velox/">https://github.com/facebookincubator/velox/</a></li>
<li><a href="https://medium.com/intel-analytics-software/accelerate-spark-sql-queries-with-gluten-9000b65d1b4e">https://medium.com/intel-analytics-software/accelerate-spark-sql-queries-with-gluten-9000b65d1b4e</a></li>
<li><a href="https://www.databricks.com/dataaisummit/session/best-exploration-columnar-shuffle-design/">https://www.databricks.com/dataaisummit/session/best-exploration-columnar-shuffle-design/</a></li>
</ol>What is Gluten? Glue in Latin Enable Spark with Native Vectorized Execution Contributed by Intel and Kyligence in 2022 Gazelle2023年终总结2023-12-29T00:00:00+00:002023-12-29T00:00:00+00:00/life/2023/12/29/summary<p>晚上吃完晚饭和老婆一起出去散步,各自聊起了过去这一年给自己留下印象最深的两到三个事情,场景或者情绪,也算是对过去一年的一点回顾和思考。</p>
<p>我们都提到的一点是一家人出去玩,印象最深的还是年初的那次去海南自驾游,那一周是一种比较轻松和自由的状态。</p>
<p>于我而言,过往一年比较重要的是更多的往内心看,从更多角度去思考自己内心的想法,更注重自己内心的感受。包括什么对自己是最重要的,什么是不需要花费太多精力去顾及和考虑的。</p>
<p>家里的方方面面我自己觉得都还不错,在2023年这样的大背景下也弥足珍贵了。</p>
<p>对于<code class="language-plaintext highlighter-rouge">2024</code>也没有具体的计划和清单,更多地还是从内心出发去行事,也希望能有一些收获和成长。</p>
<p><img src="/assets/images/sanya.jpg" alt="三亚" /></p>晚上吃完晚饭和老婆一起出去散步,各自聊起了过去这一年给自己留下印象最深的两到三个事情,场景或者情绪,也算是对过去一年的一点回顾和思考。JIT在数据库表达式求值的应用2023-12-27T00:00:00+00:002023-12-27T00:00:00+00:00/olap/2023/12/27/jit<blockquote>
<p>这个是我在 <a href="https://www.bagevent.com/event/8519252"><code class="language-plaintext highlighter-rouge">DataFunConf2023-深圳站</code></a>上的分享整理的文字内容。</p>
</blockquote>
<p>上一篇讲到了<a href="https://wuli.us/olap/2023/12/04/code-generation.html">代码生成</a>,这一篇我们会讲一下由代码生成引申出来的如下内容:</p>
<ul>
<li>即时编译(JIT)</li>
<li>JIT在数据库表达式计算中的应用</li>
<li>Arrow Gandiva</li>
</ul>
<h2 id="jit">JIT</h2>
<p>JIT(Just-in-time Compilation),<code class="language-plaintext highlighter-rouge">即时编译</code>,也称为<code class="language-plaintext highlighter-rouge">运行时编译</code>,是一种执行计算机程序的方法。程序是在执行过程中,而不是执行之前进行编译。</p>
<p>以<code class="language-plaintext highlighter-rouge">Java</code>为例,下图的流程描述了<code class="language-plaintext highlighter-rouge">Java</code>程序从源代码到机器字节码的流程以及<code class="language-plaintext highlighter-rouge">JIT</code>在这个流程中的位置。</p>
<p><img src="/assets/images/jit/jit_compiler.png" alt="JIT" /></p>
<h2 id="表达式求值">表达式求值</h2>
<p>在数据库领域,JIT技术应用在表达式代码生成,查询代码生成等多个方面。延续上一篇我们讲到的表达式求值,我们来看JIT技术在表达式求值过程中的作用。</p>
<p><img src="/assets/images/jit/expr_jit.png" alt="Expr" /></p>
<ul>
<li>首先SQL解析器会将SQL表达式解析成抽象语法树</li>
<li>其次表达式编译器再将抽象语法树生成中间字节码</li>
<li>最后由JIT编译器再运行时生成机器字节码</li>
</ul>
<h2 id="gandiva">Gandiva</h2>
<p>Apache Gandiva是一个运行时表达式编译器,利用LLVM生成用于在Arrow Record Batch上进行计算的高效本机代码。Gandiva只用来处理在投影(Projection)和过滤(Filtering)阶段的表达式。</p>
<p><img src="/assets/images/jit/gandiva.png" alt="Gandiva" /></p>
<p>Gandiva充分利用Arrow内存格式和现代硬件。基于Arrow内存模型,由于Arrow数组为值(Data)和有效位图(Validity)分别提供缓冲区,因此值及其空值状态通常可以独立处理,从而实现更好的指令流水线。在现代硬件上,使用LLVM编译表达式使执行得以优化,以适应本地运行时环境和硬件,包括可用的SIMD指令。为了减少优化开销,许多Gandiva函数被预先编译成LLVM IR(中间表示)。</p>
<p><img src="/assets/images/jit/gandiva_array.png" alt="Gandiva Array" /></p>
<p>因为采用Arrow内存格式,Gandiva对向量化和SIMD的支持也是天然的。</p>
<p><img src="/assets/images/jit/gandiva_simd.png" alt="Gandiva SIMD" /></p>
<p>此外Gandiva在提供异步线程控制,多语言支持,已经性能提升上都有着不错的表现。</p>
<h2 id="references">References</h2>
<ol>
<li><a href="https://clickhouse.com/blog/clickhouse-just-in-time-compiler-jit">https://clickhouse.com/blog/clickhouse-just-in-time-compiler-jit</a></li>
<li><a href="https://www.pingcap.com/blog/10x-performance-improvement-for-expression-evaluation-made-possible-by-vectorized-execution/">https://www.pingcap.com/blog/10x-performance-improvement-for-expression-evaluation-made-possible-by-vectorized-execution/</a></li>
<li><a href="https://notes.eatonphil.com/2023-09-21-how-do-databases-execute-expressions.html">https://notes.eatonphil.com/2023-09-21-how-do-databases-execute-expressions.html</a></li>
<li><a href="https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/">https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/</a></li>
<li><a href="https://questdb.io/blog/2022/01/12/jit-sql-compiler/">https://questdb.io/blog/2022/01/12/jit-sql-compiler/</a></li>
<li><a href="https://www.vldb.org/pvldb/vol16/p829-boncz.pdf">https://www.vldb.org/pvldb/vol16/p829-boncz.pdf</a></li>
<li><a href="https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf">Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask</a></li>
<li><a href="https://15721.courses.cs.cmu.edu/spring2016/papers/p5-sompolski.pdf">Vectorization vs. Compilation in Query Execution</a></li>
<li><a href="https://t1mm3.github.io/assets/papers/adms17.pdf">Exploring Query Execution Strategies for JIT, Vectorization and SIMD</a></li>
<li><a href="https://www.youtube.com/watch?v=bIIqfxuT4K8">Using LLVM to accelerate processing of data in Apache Arrow</a></li>
<li><a href="https://www.youtube.com/watch?v=L5NhM7kw6Eg&list=PLSE8ODhjZXjbohkNBWQs_otTrBTrjyohi&index=13">Query Execution I (CMU Databases Systems / Fall 2019) Expression Evaluation (since 54:18)</a></li>
<li><a href="https://voltrondata.com/codex/a-new-frontier">A New Frontier</a></li>
<li><a href="https://www.dremio.com/blog/announcing-gandiva-initiative-for-apache-arrow/">https://www.dremio.com/blog/announcing-gandiva-initiative-for-apache-arrow/</a></li>
<li><a href="https://www.youtube.com/watch?v=ir6V7DkBe-w">Reinventing Amazon Redshift</a></li>
</ol>这个是我在 DataFunConf2023-深圳站上的分享整理的文字内容。代码生成(Code Generation)2023-12-17T00:00:00+00:002023-12-17T00:00:00+00:00/olap/2023/12/17/code-generation<blockquote>
<p>这个是我在 <a href="https://www.bagevent.com/event/8519252"><code class="language-plaintext highlighter-rouge">DataFunConf2023-深圳站</code></a>上的分享整理的文字内容。</p>
</blockquote>
<h2 id="代码生成">代码生成</h2>
<p>代码生成(Code Generation)在数据库中有着广泛的应用。代码生成能将表达式,查询,计算过程等在程序执行的时候编译成机器码再执行,从而提高程序运行的效率。特别是随着硬件的发展,IO不再成为查询的瓶颈,对于计算密集型的查询,代码生成能极大提高查询性能。
<img src="/assets/images/jit/hardware.png" alt="Hardware" /></p>
<h2 id="为什么需要代码生成">为什么需要代码生成</h2>
<p>我们以一个简单的查询来看一下数据库的处理流程:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SELECT
price * 0.8 + distance / 1000 AS credit
FROM
trips
WHERE
arrival_city = 'Shenzhen’ AND duration < 120
</code></pre></div></div>
<p><img src="/assets/images/jit/query_plan.png" alt="Query Plan" /></p>
<p>这里主要看<code class="language-plaintext highlighter-rouge">Filter</code>阶段,也就是<code class="language-plaintext highlighter-rouge">WHERE</code>后面的表达式处理流程。通常情况下该表达式会被翻译成一个抽象语法书(<code class="language-plaintext highlighter-rouge">AST</code>)。</p>
<h3 id="解释执行">解释执行</h3>
<p>我们通过一段简单的<code class="language-plaintext highlighter-rouge">Python</code>代码来看下解释执行的过程是处理的。</p>
<p>首先定义一个简单的类来表示<code class="language-plaintext highlighter-rouge">AST</code>对应的节点:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Node</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">left</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">right</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">value</span> <span class="o">=</span> <span class="n">value</span>
<span class="bp">self</span><span class="p">.</span><span class="n">left</span> <span class="o">=</span> <span class="n">left</span>
<span class="bp">self</span><span class="p">.</span><span class="n">right</span> <span class="o">=</span> <span class="n">right</span>
</code></pre></div></div>
<p>对应上图中的<code class="language-plaintext highlighter-rouge">AST</code>,我们得到一个表达式树:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">expr_tree</span> <span class="o">=</span> <span class="n">Node</span><span class="p">(</span>
<span class="s">'AND'</span><span class="p">,</span>
<span class="n">Node</span><span class="p">(</span><span class="s">'='</span><span class="p">,</span> <span class="n">Node</span><span class="p">(</span><span class="s">'arrival_city'</span><span class="p">),</span> <span class="n">Node</span><span class="p">(</span><span class="s">'"Shenzhen"'</span><span class="p">)),</span>
<span class="n">Node</span><span class="p">(</span><span class="s">'<'</span><span class="p">,</span> <span class="n">Node</span><span class="p">(</span><span class="s">'duration'</span><span class="p">),</span> <span class="n">Node</span><span class="p">(</span><span class="s">'120'</span><span class="p">))</span>
<span class="p">)</span>
</code></pre></div></div>
<p>采用深度优先(DFS)的方式遍历表达式树,对表达式求值。这里只处理了上述表达式中存在的几种情况。</p>
<ul>
<li>调用根节点<code class="language-plaintext highlighter-rouge">AND</code>的<code class="language-plaintext highlighter-rouge">visit</code>函数,再分别调用左右子节点的<code class="language-plaintext highlighter-rouge">visit</code>函数,返回<code class="language-plaintext highlighter-rouge">AND</code>操作结果</li>
<li>调用左节点<code class="language-plaintext highlighter-rouge">=</code>的<code class="language-plaintext highlighter-rouge">visit</code>函数,比较字段<code class="language-plaintext highlighter-rouge">arrival_ciy</code>和字符串常量<code class="language-plaintext highlighter-rouge">Shenzhen</code>的值</li>
<li>调用又节点<code class="language-plaintext highlighter-rouge"><</code>的<code class="language-plaintext highlighter-rouge">visit</code>函数,比较字段<code class="language-plaintext highlighter-rouge">duration</code>和整型常量<code class="language-plaintext highlighter-rouge">120</code>的值</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">visit</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">record</span><span class="p">):</span>
<span class="k">if</span> <span class="n">node</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="k">if</span> <span class="n">node</span><span class="p">.</span><span class="n">value</span> <span class="o">==</span> <span class="s">'AND'</span><span class="p">:</span>
<span class="n">left_result</span> <span class="o">=</span> <span class="n">visit</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">left</span><span class="p">,</span> <span class="n">record</span><span class="p">)</span>
<span class="n">right_result</span> <span class="o">=</span> <span class="n">visit</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">right</span><span class="p">,</span> <span class="n">record</span><span class="p">)</span>
<span class="k">return</span> <span class="n">left_result</span> <span class="ow">and</span> <span class="n">right_result</span> <span class="k">if</span> <span class="n">left_result</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="ow">and</span> <span class="n">right_result</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="k">else</span> <span class="bp">None</span>
<span class="k">elif</span> <span class="n">node</span><span class="p">.</span><span class="n">value</span> <span class="o">==</span> <span class="s">'<'</span><span class="p">:</span>
<span class="n">left_value</span> <span class="o">=</span> <span class="n">get_value</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">left</span><span class="p">,</span> <span class="n">record</span><span class="p">)</span>
<span class="n">right_value</span> <span class="o">=</span> <span class="n">get_value</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">right</span><span class="p">,</span> <span class="n">record</span><span class="p">)</span>
<span class="k">return</span> <span class="n">left_value</span> <span class="o"><</span> <span class="n">right_value</span> <span class="k">if</span> <span class="n">left_value</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="ow">and</span> <span class="n">right_value</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="k">else</span> <span class="bp">None</span>
<span class="k">elif</span> <span class="n">node</span><span class="p">.</span><span class="n">value</span> <span class="o">==</span> <span class="s">'='</span><span class="p">:</span>
<span class="n">left_value</span> <span class="o">=</span> <span class="n">get_value</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">left</span><span class="p">,</span> <span class="n">record</span><span class="p">)</span>
<span class="n">right_value</span> <span class="o">=</span> <span class="n">get_value</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">right</span><span class="p">,</span> <span class="n">record</span><span class="p">)</span>
<span class="k">return</span> <span class="n">left_value</span> <span class="o">==</span> <span class="n">right_value</span>
<span class="c1"># 处理其他操作符和操作数的情况
</span> <span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="k">def</span> <span class="nf">get_value</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">record</span><span class="p">):</span>
<span class="k">if</span> <span class="n">node</span><span class="p">.</span><span class="n">value</span><span class="p">.</span><span class="n">isdigit</span><span class="p">():</span>
<span class="k">return</span> <span class="nb">int</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">value</span><span class="p">)</span>
<span class="k">elif</span> <span class="n">node</span><span class="p">.</span><span class="n">value</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'"'</span><span class="p">)</span> <span class="ow">and</span> <span class="n">node</span><span class="p">.</span><span class="n">value</span><span class="p">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">'"'</span><span class="p">):</span>
<span class="k">return</span> <span class="n">node</span><span class="p">.</span><span class="n">value</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="k">elif</span> <span class="n">node</span><span class="p">.</span><span class="n">value</span> <span class="ow">in</span> <span class="n">record</span><span class="p">:</span>
<span class="k">return</span> <span class="n">record</span><span class="p">[</span><span class="n">node</span><span class="p">.</span><span class="n">value</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">None</span>
</code></pre></div></div>
<p>这种解释执行的方式简单易理解,但是存在一些问题:</p>
<ul>
<li>大量虚函数调用
<ul>
<li>非确定性跳转指令,CPU无法做分支预测,打断CPU流水线</li>
</ul>
</li>
<li>计算中无法确定类型,算子中存在很多动态类型判断</li>
<li>递归函数调用打断计算过程</li>
</ul>
<h2 id="什么是代码生成">什么是代码生成</h2>
<p>代码生成就是生成需要执行的代码。这里我们还是使用<code class="language-plaintext highlighter-rouge">Python</code>来模拟采用代码生成的方式来处理上面的表达式:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">generate_code</span><span class="p">(</span><span class="n">node</span><span class="p">):</span>
<span class="k">if</span> <span class="n">node</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="k">return</span> <span class="s">""</span>
<span class="k">if</span> <span class="n">node</span><span class="p">.</span><span class="n">value</span> <span class="ow">in</span> <span class="p">(</span><span class="s">'AND'</span><span class="p">,</span> <span class="s">'OR'</span><span class="p">):</span>
<span class="n">left_code</span> <span class="o">=</span> <span class="n">generate_code</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">left</span><span class="p">)</span>
<span class="n">right_code</span> <span class="o">=</span> <span class="n">generate_code</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">right</span><span class="p">)</span>
<span class="k">return</span> <span class="sa">f</span><span class="s">"(</span><span class="si">{</span><span class="n">left_code</span><span class="si">}</span><span class="s">) </span><span class="si">{</span><span class="n">node</span><span class="p">.</span><span class="n">value</span><span class="p">.</span><span class="n">lower</span><span class="p">()</span><span class="si">}</span><span class="s"> (</span><span class="si">{</span><span class="n">right_code</span><span class="si">}</span><span class="s">)"</span>
<span class="k">elif</span> <span class="n">node</span><span class="p">.</span><span class="n">value</span> <span class="ow">in</span> <span class="p">(</span><span class="s">'>'</span><span class="p">,</span> <span class="s">'<'</span><span class="p">):</span>
<span class="k">return</span> <span class="sa">f</span><span class="s">"record['</span><span class="si">{</span><span class="n">node</span><span class="p">.</span><span class="n">left</span><span class="p">.</span><span class="n">value</span><span class="si">}</span><span class="s">'] </span><span class="si">{</span><span class="n">node</span><span class="p">.</span><span class="n">value</span><span class="si">}</span><span class="s"> </span><span class="si">{</span><span class="n">node</span><span class="p">.</span><span class="n">right</span><span class="p">.</span><span class="n">value</span><span class="si">}</span><span class="s">"</span>
<span class="k">elif</span> <span class="n">node</span><span class="p">.</span><span class="n">value</span> <span class="o">==</span> <span class="s">'='</span><span class="p">:</span>
<span class="k">return</span> <span class="sa">f</span><span class="s">"record['</span><span class="si">{</span><span class="n">node</span><span class="p">.</span><span class="n">left</span><span class="p">.</span><span class="n">value</span><span class="si">}</span><span class="s">'] == </span><span class="si">{</span><span class="n">node</span><span class="p">.</span><span class="n">right</span><span class="p">.</span><span class="n">value</span><span class="si">}</span><span class="s">"</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="sa">f</span><span class="s">"Unsupported operation: </span><span class="si">{</span><span class="n">node</span><span class="p">.</span><span class="n">value</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">jit_evaluator</span><span class="p">(</span><span class="n">expr_code</span><span class="p">,</span> <span class="n">record</span><span class="p">):</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">local_vars</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">exec</span><span class="p">(</span><span class="sa">f</span><span class="s">"result = </span><span class="si">{</span><span class="n">expr_code</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="p">{</span><span class="s">'record'</span><span class="p">:</span> <span class="n">record</span><span class="p">},</span> <span class="n">local_vars</span><span class="p">)</span>
<span class="k">return</span> <span class="n">local_vars</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'result'</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
<span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Error evaluating expression: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">False</span>
<span class="c1"># 生成代码并执行
</span><span class="n">expr_code</span> <span class="o">=</span> <span class="n">generate_code</span><span class="p">(</span><span class="n">expr_tree</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Generated expression code: </span><span class="si">{</span><span class="n">expr_code</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">for</span> <span class="n">record</span> <span class="ow">in</span> <span class="n">database</span><span class="p">:</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">jit_evaluator</span><span class="p">(</span><span class="n">expr_code</span><span class="p">,</span> <span class="n">record</span><span class="p">)</span>
<span class="k">if</span> <span class="n">result</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Record </span><span class="si">{</span><span class="n">record</span><span class="si">}</span><span class="s"> matches the filter."</span><span class="p">)</span>
</code></pre></div></div>
<p>通常情况下我们会采用LLVM的中间语言(IR)作为生成的代码语言。整个表达式求值的执行过程就分为两步:</p>
<ul>
<li>生成中间代码(IR)</li>
<li>编译器编译成机器码</li>
</ul>
<p>其中生成中间代码通常会先进行<strong>类型推导</strong>,然后再生成代码。进行<strong>类型推导</strong>的过程也是检查表达式是否合法的过程。</p>
<p>编译器对生成的代码编译,得到机器可以执行的机器码,这里面就包含我们需要执行的类和函数。在执行过程中调用具体的逻辑就可以得到表达式的值。</p>这个是我在 DataFunConf2023-深圳站上的分享整理的文字内容。Mac下模拟慢磁盘测试场景2023-11-30T00:00:00+00:002023-11-30T00:00:00+00:00/mac/2023/11/30/mac-slow-disk<p>开发过程中有时候要测试慢磁盘的场景,Mac下自带了一个叫做dmc (<a href="https://manp.gs/mac/1/dmc">disk mount conditioner</a>)的命令,可以把一个目录下面的文件访问降速从而达到模拟一个慢速设备的目的。</p>
<p>它预置了一些设备参数可以选择:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ dmc list
0: Faulty 5400 HDD
1: 5400 HDD
2: 7200 HDD
3: Slow SSD
4: SATA II SSD
5: SATA III SSD
6: PCIe 2 SSD
7: PCIe 3 SSD
</code></pre></div></div>
<p>假设我们要限制文件夹<code class="language-plaintext highlighter-rouge">~/data</code>下的读写速度,那么可以运行如下命令:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo dmc start ~/data/ "Slow SSD"
</code></pre></div></div>
<p>可以查看<code class="language-plaintext highlighter-rouge">Slow SSD</code>对应的具体读写速度限制:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> dmc show "Slow SSD"
Profile: Slow SSD
Type: SSD
Access time: 100 us
Read throughput: 250 MB/s
Write throughput: 125 MB/s
I/O Queue Depth: 32
Max Read Bytes: 33554432
Max Write Bytes: 33554432
Max Read Segments: 64
Max Write Segments: 64
</code></pre></div></div>
<p>测试完了之后可以使用<code class="language-plaintext highlighter-rouge">dmc stop</code>来清除之前的设置,使对应文件夹下的读写速度恢复正常。</p>开发过程中有时候要测试慢磁盘的场景,Mac下自带了一个叫做dmc (disk mount conditioner)的命令,可以把一个目录下面的文件访问降速从而达到模拟一个慢速设备的目的。2022年终总结2022-12-24T00:00:00+00:002022-12-24T00:00:00+00:00/life/2022/12/24/summary<p>一年一度的总结(流水帐)来了。今年有一些新的变化,但是整体并没有太多精彩的事情,略微平淡。</p>
<p>年初刚过完年那会儿,桃桃外婆过来换奶奶,打算在这边待一个多月再回去。后续外公因为家里有事情提前回去了,然后就是上海封城,直到6月。这两天早上6点起来买叮咚买菜还抢不到运力,那段让人难受的封控日子又浮现在眼前。这样一折腾半年就过去了。</p>
<p>解封后我们把桃桃送回去,到开学的时候又接回上海。这期间趁着周末的时间把装修的事情定得差不多。也是因为要上幼儿园的原因,8月底不得不搬到闵行这边来。又担心装修甲醛之类的影响,继续在桃桃新的幼儿园旁边租了一个小房间过度了一个多月。这段时间过着候鸟一样的生活,晚上过去睡觉,白天我们去上班,桃桃上学。大概在国庆节的时候我们算正式搬入新家了。</p>
<p>这个学期我们给桃桃找了一个晚托班,尝试我们俩人自己带娃。上班通勤的时间变得更长了一点,家里还有一些需要慢慢处理的事情,加上给她报了几个兴趣班,当所有这样一堆事情都夹杂在一起的时候,我们俩每天就像是在和时间赛跑一样,没有多少个人的时间。这快到年底了,随着疫情防控的放开,还是把桃桃奶奶叫过来帮忙,不然幼儿园这放假了待在家里我们俩真吃不消。</p>
<p>工作上只是把该做的事情做好了,并没有太多亮点。年初自己定了几个工作之外的小目标大部分也没有实现,从这个方面来讲也是挺失败的。不过这一年有过一些不太深入的职业上的思考,有一些焦虑。这方面希望来年有更多的尝试和总结。</p>
<p>总的来说这过去的一年还是充实饱满的。我和我的队友相互配合,推着整个小家稳稳前行,也算是在动荡的大环境之下苟住了。队友总比我给力,她的韧性总能爆发出更大的力量。中年人没有那么多不切实际的幻想,把身体整好,照顾好家庭,有老婆热炕头,再踏实地做点事情,满足了。</p>一年一度的总结(流水帐)来了。今年有一些新的变化,但是整体并没有太多精彩的事情,略微平淡。Data Science Tool2022-03-22T22:00:00+00:002022-03-22T22:00:00+00:00/python/2022/03/22/data-science<ul>
<li><a href="https://hex.tech/">https://hex.tech/</a>
<ul>
<li>a collaborative data platform that brings everyone together to explore, analyze, and share</li>
</ul>
</li>
<li><a href="https://www.getdbt.com/">https://www.getdbt.com/</a>
<ul>
<li>helps data teams work like software engineers—to ship trusted data, faster</li>
</ul>
</li>
<li><a href="https://streamlit.io/">https://streamlit.io/</a>
<ul>
<li>the fastest way to build and share data apps</li>
</ul>
</li>
</ul>https://hex.tech/ a collaborative data platform that brings everyone together to explore, analyze, and share https://www.getdbt.com/ helps data teams work like software engineers—to ship trusted data, faster https://streamlit.io/ the fastest way to build and share data apps感谢之前的帮助2022-03-21T22:00:00+00:002022-03-21T22:00:00+00:00/life/2022/03/21/move-out<p>昨天在看朋友圈的时候发现之前在读研究生的时候认识的一个老板发的朋友圈在美国,随手问了几句发现的确是去美国生活去了。</p>
<p>认识他是在去参加一个技术分享会,会后晚饭是他组局买单的,大家一起加了联系方式。他当时是想做在线办公的软件的互联互通,比如微软的office三件套和WPS的文档能够互操作,场景大概是一个厂商的用户能无缝和另外一个厂商的用户在线沟通和协作。姑且不说这个想法是否合理。他当时的情况是没有IT行业的技术背景,找了一个同样没有技术背景的人来做这个事情,所以可想而知在和其他人交流和沟通的时候别人是怎样一个反应。</p>
<p>后来找我去给他帮帮忙,看他为人的确不错,所以帮着他维护一些云服务器,帮着做一些他技术上不懂的事情,大概是以兼职的形式进行的。 他之所以有资本做这个事情也是因为之前卖医疗器材有一些积累, 所以还能拉扯个小团队。这个事情我记不太清楚坚持了多久,的确是没有什么产出,到处参加了不少研讨会和技术分享会,包括Google I/O大会的门票他也有能力搞到并且跑到那边去还能和那边的高层领导聊上几句。 这点我的确是佩服的。</p>
<p>后来发现这个好像的确做不通,他转去做食品安全相关的一些解决方案, 类似想做一个点评一样的平台去监督每个餐厅的食品安全。做了一些简单的尝试,包括人工地推。我参与了一些服务器维护和建站的简单工作,其实没有帮上太大的忙。</p>
<p>这之后我毕业了就没有继续做技术支持了, 他也邀请我去他那里和他一起做,待遇什么的和我找的最好的工作待遇齐平没问题之类的,我婉谢了。后来请他吃了一次饭,随便聊了一些。中间我爸因为车祸住院,他给了一个非常大的红包说是看我爸的,我谢绝了他还是执意要给我。他一直给我说的一个事情是,不要对这个有什么心理负担,你之后把这种善意拿去帮助别人就是对我的一个反馈。这句话对我影响挺大,之后我也是以这样的理念去践行的。</p>
<p>人生的经历因为这样的一些人和事变得丰富, 感谢他曾经的帮助和友善,祝福他能在大洋彼岸生活得幸福开心。</p>昨天在看朋友圈的时候发现之前在读研究生的时候认识的一个老板发的朋友圈在美国,随手问了几句发现的确是去美国生活去了。