تنميط DataStore - ClickHouse Documentation

تساعدك أداة Profiler في DataStore على قياس زمن التنفيذ وتحديد مواضع اختناق الأداء.

البدء السريع

from chdb import datastore as pd
from chdb.datastore.config import config, get_profiler

# Enable profiling
config.enable_profiling()

# Run your operations
ds = pd.read_csv("large_data.csv")
result = (ds
    .filter(ds['amount'] > 100)
    .groupby('category')
    .agg({'amount': 'sum'})
    .sort('sum', ascending=False)
    .head(10)
    .to_df()
)

# View report
profiler = get_profiler()
print(profiler.report())

تمكين التنميط

from chdb.datastore.config import config

# Enable profiling
config.enable_profiling()

# Disable profiling
config.disable_profiling()

# Check if profiling is enabled
print(config.profiling_enabled)  # True or False

واجهة برمجة التطبيقات لـ Profiler

استرجاع Profiler

from chdb.datastore.config import get_profiler

profiler = get_profiler()

report()

يعرض تقريرًا عن الأداء.

profiler.report(min_duration_ms=0.1)

المعلمات:

المعلمة	النوع	الافتراضي	الوصف
`min_duration_ms`	عدد عشري	`0.1`	اعرض فقط الخطوات التي مدتها >= هذه القيمة

ناتج المثال:

======================================================================
EXECUTION PROFILE
======================================================================
   45.79ms (100.0%) Total Execution
     23.25ms ( 50.8%) Query Planning [ops_count=2]
     22.29ms ( 48.7%) SQL Segment 1 [ops=2]
       20.48ms ( 91.9%) SQL Execution
        1.74ms (  7.8%) Result to DataFrame
----------------------------------------------------------------------
      TOTAL:    45.79ms
======================================================================

يعرض التقرير ما يلي:

المدة بالمللي ثانية لكل خطوة
النسبة المئوية من وقت الخطوة الأب/إجمالي الوقت
التداخل الهرمي للعمليات
البيانات الوصفية لكل خطوة (مثلًا، ops_count وops)

step()

قِس مدة تنفيذ كتلة تعليمات برمجية يدويًا.

with profiler.step("custom_operation"):
    # Your code here
    expensive_operation()

clear()

امسح جميع بيانات تحليل الأداء.

profiler.clear()

summary()

يعيد قاموسًا يربط أسماء الخطوات بالمدد (مللي ثانية).

summary = profiler.summary()
for name, duration in summary.items():
    print(f"{name}: {duration:.2f}ms")

ناتج المثال:

Total Execution: 45.79ms
Total Execution.Cache Check: 0.00ms
Total Execution.Query Planning: 23.25ms
Total Execution.SQL Segment 1: 22.29ms
Total Execution.SQL Segment 1.SQL Execution: 20.48ms
Total Execution.SQL Segment 1.Result to DataFrame: 1.74ms

فهم التقرير

أسماء الخطوات

اسم الخطوة	الوصف
`Total Execution`	إجمالي وقت التنفيذ
`Query Planning`	الوقت المستغرَق في تخطيط الاستعلام
`SQL Segment N`	تنفيذ مقطع SQL رقم N
`SQL Execution`	التنفيذ الفعلي لاستعلام SQL
`Result to DataFrame`	تحويل النتائج إلى pandas
`Cache Check`	التحقق من ذاكرة تخزين الاستعلامات المؤقتة
`Cache Write`	كتابة النتائج إلى ذاكرة التخزين المؤقت

المدة

خطوات التخطيط (تخطيط الاستعلام): تكون عادةً سريعة
خطوات التنفيذ (تنفيذ SQL): حيث يتم العمل الفعلي
خطوات النقل (من النتيجة إلى DataFrame): تحويل البيانات إلى pandas

تحديد مواطن الاختناق

======================================================================
EXECUTION PROFILE
======================================================================
  200.50ms (100.0%) Total Execution
    10.25ms (  5.1%) Query Planning [ops_count=4]
   190.00ms ( 94.8%) SQL Segment 1 [ops=4]
     185.00ms ( 97.4%) SQL Execution    <- Main bottleneck
       5.00ms (  2.6%) Result to DataFrame
----------------------------------------------------------------------
      TOTAL:   200.50ms
======================================================================

أنماط التنميط

حلّل استعلامًا واحدًا

config.enable_profiling()
profiler = get_profiler()
profiler.clear()  # Clear previous data

# Run query
result = ds.filter(...).groupby(...).agg(...).to_df()

# View this query's profile
print(profiler.report())

حلّل عدة استعلامات

config.enable_profiling()
profiler = get_profiler()
profiler.clear()

# Query 1
with profiler.step("Query 1"):
    result1 = query1.to_df()

# Query 2
with profiler.step("Query 2"):
    result2 = query2.to_df()

print(profiler.report())

مقارنة بين الأساليب

profiler = get_profiler()

# Approach 1: Filter then groupby
profiler.clear()
with profiler.step("filter_then_groupby"):
    result1 = ds.filter(ds['x'] > 10).groupby('y').sum().to_df()
summary1 = profiler.summary()
time1 = summary1.get('filter_then_groupby', 0)

# Approach 2: Groupby then filter
profiler.clear()
with profiler.step("groupby_then_filter"):
    result2 = ds.groupby('y').sum().filter(ds['x'] > 10).to_df()
summary2 = profiler.summary()
time2 = summary2.get('groupby_then_filter', 0)

print(f"Approach 1: {time1:.2f}ms")
print(f"Approach 2: {time2:.2f}ms")
print(f"Winner: {'Approach 1' if time1 < time2 else 'Approach 2'}")

نصائح لتحسين الأداء

1. تحقّق من وقت تنفيذ SQL

إذا كان SQL execution هو موضع الاختناق:

أضف المزيد من عوامل التصفية لتقليل حجم البيانات
استخدم Parquet بدلًا من CSV
تحقّق من وجود فهارس مناسبة (لمصادر البيانات المعتمدة على قواعد البيانات)

2. تحقّق من زمن I/O

إذا كان read_csv أو read_parquet هو موضع الاختناق:

استخدم Parquet (تنسيق عمودي ومضغوط)
اقرأ الأعمدة المطلوبة فقط
طبّق التصفية عند المصدر إن أمكن

3. تحقّق من نقل البيانات

إذا كان to_df بطيئًا:

قد تكون مجموعة النتائج كبيرة جدًا
أضف المزيد من عوامل التصفية أو ضع حدًا
استخدم head() للمعاينة

4. مقارنة المحركات

from chdb.datastore.config import config

# Profile with chdb
config.use_chdb()
profiler.clear()
result_chdb = query.to_df()
time_chdb = profiler.total_duration_ms

# Profile with pandas
config.use_pandas()
profiler.clear()
result_pandas = query.to_df()
time_pandas = profiler.total_duration_ms

print(f"chdb: {time_chdb:.2f}ms")
print(f"pandas: {time_pandas:.2f}ms")

أفضل الممارسات

1. حلّل الأداء قبل التحسين

# Don't guess - measure!
config.enable_profiling()
result = your_query.to_df()
print(get_profiler().report())

2. نظِّف بين الاختبارات

profiler.clear()  # Clear previous data
# Run test
print(profiler.report())

3. استخدم `min_duration_ms` للتركيز

# Only show operations >= 100ms
profiler.report(min_duration_ms=100)

4. حلّل البيانات النموذجية

# Profile with real-world data sizes
# Small test data may not show real bottlenecks

5. التعطيل في بيئة الإنتاج

# Development
config.enable_profiling()

# Production
config.set_profiling_enabled(False)  # Avoid overhead

مثال: جلسة التنميط الكاملة

from chdb import datastore as pd
from chdb.datastore.config import config, get_profiler

# Setup
config.enable_profiling()
config.enable_debug()  # Also see what's happening
profiler = get_profiler()

# Load data
profiler.clear()
print("=== Loading Data ===")
ds = pd.read_csv("sales_2024.csv")  # 10M rows
print(profiler.report())

# Query 1: Simple filter
profiler.clear()
print("\n=== Query 1: Simple Filter ===")
result1 = ds.filter(ds['amount'] > 1000).to_df()
print(profiler.report())

# Query 2: Complex aggregation
profiler.clear()
print("\n=== Query 2: Complex Aggregation ===")
result2 = (ds
    .filter(ds['amount'] > 100)
    .groupby('region', 'category')
    .agg({
        'amount': ['sum', 'mean', 'count'],
        'quantity': 'sum'
    })
    .sort('sum', ascending=False)
    .head(20)
    .to_df()
)
print(profiler.report())

# Summary
print("\n=== Summary ===")
print(f"Query 1: {len(result1)} rows")
print(f"Query 2: {len(result2)} rows")

​البدء السريع

​تمكين التنميط

​واجهة برمجة التطبيقات لـ Profiler

​استرجاع Profiler

​report()

​step()

​clear()

​summary()

​فهم التقرير

​أسماء الخطوات

​المدة

​تحديد مواطن الاختناق

​أنماط التنميط

​حلّل استعلامًا واحدًا

​حلّل عدة استعلامات

​مقارنة بين الأساليب

​نصائح لتحسين الأداء

​1. تحقّق من وقت تنفيذ SQL

​2. تحقّق من زمن I/O

​3. تحقّق من نقل البيانات

​4. مقارنة المحركات

​أفضل الممارسات

​1. حلّل الأداء قبل التحسين

​2. نظِّف بين الاختبارات

​3. استخدم min_duration_ms للتركيز

​4. حلّل البيانات النموذجية

​5. التعطيل في بيئة الإنتاج

​مثال: جلسة التنميط الكاملة