Write Section 2 of your IP: precise data description and the QD approval gate. Understand the three major data biases and how to avoid them.
Data quality determines backtest quality. This week teaches what belongs in Section 2a (detailed data sourcing and cleaning), the three major data biases (survivorship, look-ahead, and point-in-time), and why Section 2b (the QD approval gate) must be completed before you run any backtest.
Your data section should contain:
DATA SOURCING & CLEANING — Corn Futures Strategy
2a.1 Data series
Continuous corn futures (CBOT ZC), front-month contract rolled 5 trading days before first notice date. Daily close prices from 2010-01-01 to 2024-01-31. Units: cents per bushel. Source: Databento (live) / CME historical archive (backtest).
2a.2 Cleaning steps
2a.3 Features
2a.4 Descriptive statistics (2020–2024 sample)
What it is: Your backtest universe only includes assets that survived to today. You've excluded all the bankruptcies, delistings, and failures.
Example: You backtest a long-only equity strategy on the current S&P 500 constituents going back to 2010. But the current 500 are not the same 500 as 2010. Companies that were in the index in 2010 and have since been delisted are missing from your analysis. Your backtest systematically excludes the losers.
Impact: Your backtest Sharpe overstates live Sharpe. You're comparing 14 years of current winners against 14 years of actual 2010 universe (which included some failures). The survivor set looks better.
How to fix: Use point-in-time constituent lists. If you want to backtest on S&P 500 from 2010–2024, get the actual S&P 500 constituents as they were on each date, not today's constituents.
What it is: Using data in your signal that wasn't available at the time the signal fired.
Example: You build a signal using dividend-adjusted prices. But dividend adjustments are applied backward — the adjusted price on day T reflects information (the dividend) that was announced on day T+500. Your signal is using future information.
More examples:
How to fix: Use as-reported data, not revised. Use prices available at the time. Build features on raw (not backward-adjusted) prices, apply adjustments only to returns.
What it is: Fundamental data (earnings, balance sheets, analyst estimates) is released with a lag and revised afterward. If you backtest using the final revised figure, you're using data the market didn't have yet.
Example: A company reports Q1 earnings of $1.50 EPS on April 15. But the reported figure is later revised to $1.45 on July 1 due to restatement. If you use $1.45 in your backtest for April 15, you're using information that wasn't available on April 15.
How to fix: Use as-reported data on the as-reported date. Avoid pre-revised figures. Many data providers (Bloomberg, FactSet) provide point-in-time data specifically for this reason.
The orange curve (survivors only) compounds faster than the blue curve (including delisted companies). The difference grows over time — by year 15, survivorship bias has inflated apparent returns by roughly 16%.
This is non-negotiable. You cannot submit an IP with incomplete Section 2b. This is not bureaucracy — it's a forcing function to catch infeasible research early.
Section 2b must contain:
SECTION 2B: QD APPROVAL — Corn Futures Strategy
Data sourcing: Continuous corn futures (CBOT ZC) daily close from 2010-01-01 to 2024-01-31 is available via Databento at no additional cost (existing subscription covers). Historical archive verified via CME website.
Ingestibility: Daily OHLCV can be ingested into TimescaleDB continuous aggregate within 1 hour of daily close. Data-ngin pipeline has existing handling for commodity futures roll management.
Alternative data: None required.
Head of Data Engineering approval (Jane Smith, jane.smith@algogators.com):
"Confirmed: ZC daily data available, ingestible, covers requested date range. Rollover logic already in place. No additional pipeline work required. Approved for development. — JS"
import numpy as np
# Cap extreme returns at ±3 standard deviations
mean_ret = df['returns'].mean()
std_ret = df['returns'].std()
df['returns_clean'] = df['returns'].clip(
lower=mean_ret - 3*std_ret,
upper=mean_ret + 3*std_ret
)
# Forward-fill prices (last known price)
df['close'] = df['close'].ffill()
# Do NOT forward-fill volume — use NaN-aware methods
df['volume_avg'] = df['volume'].rolling(20).mean()
# rolling() handles NaN gracefully
def rolling_zscore(series, window=60):
return (series - series.rolling(window).mean()) / \
series.rolling(window).std()
df['signal'] = rolling_zscore(df['close'], window=60)