IP Anchor Section 2 — Data sourcing & QD approval Section 2a is your data description. Section 2b is the QD gate: without it, your IP is incomplete. You can't proceed without both.

What this week covers

Data quality determines backtest quality. This week teaches what belongs in Section 2a (detailed data sourcing and cleaning), the three major data biases (survivorship, look-ahead, and point-in-time), and why Section 2b (the QD approval gate) must be completed before you run any backtest.

Section 2a: What to include

Your data section should contain:

  • Data series: Asset name, exchange/source, frequency (daily/hourly/tick), date range, units
  • Cleaning steps: Outlier treatment, missing value handling, corporate action adjustments with specific methods
  • Feature construction: Derived variables, with exact formulas. Example: "5-day rolling z-score: (close_t - mean(close_t-5:t)) / std(close_t-5:t)"
  • Descriptive statistics: Mean, std, skewness, kurtosis, lag-1 autocorrelation of primary series

Worked example Section 2a: Corn futures

DATA SOURCING & CLEANING — Corn Futures Strategy

2a.1 Data series

Continuous corn futures (CBOT ZC), front-month contract rolled 5 trading days before first notice date. Daily close prices from 2010-01-01 to 2024-01-31. Units: cents per bushel. Source: Databento (live) / CME historical archive (backtest).

2a.2 Cleaning steps

  • Raw prices: unadjusted. Split/delivery adjustments N/A (futures).
  • Outlier detection: daily moves > 10% of 30-day rolling volatility flagged for inspection. None removed (verified legitimate). Liquidity gap handling: contracts within 5 trading days of first notice are excluded (low liquidity, high bid-ask).
  • Missing days: none (futures trade daily).
  • Roll handling: front month rolled 5 days before expiry. New contract takes over seamlessly in TimescaleDB continuous aggregate.

2a.3 Features

  • Price momentum (5-day): (close_t / close_t-5) - 1
  • Volatility (30-day): std(log-returns_t-30:t) × sqrt(252) [annualized]
  • Seasonal factor: deviation from 10-year average price for calendar day-of-year

2a.4 Descriptive statistics (2020–2024 sample)

  • Mean daily return: 0.08%
  • Std dev: 1.42%
  • Skewness: -0.31 (left tail)
  • Kurtosis: 4.2 (fat tails, excess kurtosis = 1.2)
  • AC(1): 0.04 (minimal autocorrelation)

The three major data biases

1. Survivorship bias

What it is: Your backtest universe only includes assets that survived to today. You've excluded all the bankruptcies, delistings, and failures.

Example: You backtest a long-only equity strategy on the current S&P 500 constituents going back to 2010. But the current 500 are not the same 500 as 2010. Companies that were in the index in 2010 and have since been delisted are missing from your analysis. Your backtest systematically excludes the losers.

Impact: Your backtest Sharpe overstates live Sharpe. You're comparing 14 years of current winners against 14 years of actual 2010 universe (which included some failures). The survivor set looks better.

How to fix: Use point-in-time constituent lists. If you want to backtest on S&P 500 from 2010–2024, get the actual S&P 500 constituents as they were on each date, not today's constituents.

2. Look-ahead bias

What it is: Using data in your signal that wasn't available at the time the signal fired.

Example: You build a signal using dividend-adjusted prices. But dividend adjustments are applied backward — the adjusted price on day T reflects information (the dividend) that was announced on day T+500. Your signal is using future information.

More examples:

  • Using the month's final close on day 1 of the month
  • Using year-end financial data to predict January returns
  • Using revised/final numbers instead of as-reported earnings
  • Using forward prices to trade spot on the same day (latency)

How to fix: Use as-reported data, not revised. Use prices available at the time. Build features on raw (not backward-adjusted) prices, apply adjustments only to returns.

3. Point-in-time data

What it is: Fundamental data (earnings, balance sheets, analyst estimates) is released with a lag and revised afterward. If you backtest using the final revised figure, you're using data the market didn't have yet.

Example: A company reports Q1 earnings of $1.50 EPS on April 15. But the reported figure is later revised to $1.45 on July 1 due to restatement. If you use $1.45 in your backtest for April 15, you're using information that wasn't available on April 15.

How to fix: Use as-reported data on the as-reported date. Avoid pre-revised figures. Many data providers (Bloomberg, FactSet) provide point-in-time data specifically for this reason.

Chart: Survivorship bias impact

The orange curve (survivors only) compounds faster than the blue curve (including delisted companies). The difference grows over time — by year 15, survivorship bias has inflated apparent returns by roughly 16%.

Section 2b: The QD approval gate

This is non-negotiable. You cannot submit an IP with incomplete Section 2b. This is not bureaucracy — it's a forcing function to catch infeasible research early.

Section 2b must contain:

  • Data availability: exact asset/frequency/date range, confirmed available from Databento or other source
  • Ingestibility: confirmation that data can be ingested into TimescaleDB at the required frequency
  • Alternative data licensing: if using satellite, social, proprietary data — confirmed licensed and available
  • Written sign-off from the Head of Data Engineering confirming all above

Example Section 2b

SECTION 2B: QD APPROVAL — Corn Futures Strategy

Data sourcing: Continuous corn futures (CBOT ZC) daily close from 2010-01-01 to 2024-01-31 is available via Databento at no additional cost (existing subscription covers). Historical archive verified via CME website.

Ingestibility: Daily OHLCV can be ingested into TimescaleDB continuous aggregate within 1 hour of daily close. Data-ngin pipeline has existing handling for commodity futures roll management.

Alternative data: None required.

Head of Data Engineering approval (Jane Smith, jane.smith@algogators.com):

"Confirmed: ZC daily data available, ingestible, covers requested date range. Rollover logic already in place. No additional pipeline work required. Approved for development. — JS"

Data cleaning Python patterns

Winsorizing outliers

import numpy as np

# Cap extreme returns at ±3 standard deviations
mean_ret = df['returns'].mean()
std_ret = df['returns'].std()
df['returns_clean'] = df['returns'].clip(
    lower=mean_ret - 3*std_ret,
    upper=mean_ret + 3*std_ret
)

Missing value handling

# Forward-fill prices (last known price)
df['close'] = df['close'].ffill()

# Do NOT forward-fill volume — use NaN-aware methods
df['volume_avg'] = df['volume'].rolling(20).mean()
# rolling() handles NaN gracefully

Rolling z-score normalization

def rolling_zscore(series, window=60):
    return (series - series.rolling(window).mean()) / \
           series.rolling(window).std()

df['signal'] = rolling_zscore(df['close'], window=60)

Common mistakes

Five data sins

  • Using adjusted-close prices for signal construction. This is look-ahead bias. Use raw close, adjust only the final returns for splits and dividends.
  • Not documenting the exact date range. "2010–2024 S&P 500 data" is vague. "ZC CBOT front-contract, 2010-01-01 through 2024-01-31, rolled 5 days before first notice" is precise. Be precise.
  • Cleaning data by removing bad periods. "I removed 2008 because it was too volatile." Wrong. Document what happened (gap, halt, unusual volume) and handle it (forward-fill, interpolate). Don't just delete outliers.
  • Not checking historical depth of alternative data. Satellite data goes back 10 years. That's 2.5 full market cycles — okay. But if it only goes back 3 years, you have only one bull/bear cycle — too short.
  • Incomplete Section 2b. If you don't have written QD approval, your IP is rejected. No exceptions.
← Week 5: The Hypothesis Week 7: Signal Construction →