Supercharging SQL Server Data Pipelines: Arrow Support in mssql-python

By ✦ min read

Faster, Leaner Data Fetching with Apache Arrow

Retrieving a million rows from SQL Server into a Polars DataFrame used to be a sluggish process: each row required creating a Python object, allocating memory in the garbage collector, and then discarding those objects to build the DataFrame. That era is over. The mssql-python driver now supports fetching SQL Server data directly as Apache Arrow structures, offering a faster and more memory-efficient path for anyone working with Polars, Pandas, DuckDB, or other Arrow-native libraries. This feature was contributed by community developer Felix Graßl (@ffelixg), and we are excited to make it available.

Supercharging SQL Server Data Pipelines: Arrow Support in mssql-python
Source: devblogs.microsoft.com

Understanding the Core Concepts

Before diving into the benefits, it helps to clarify a few terms that underpin this new capability:

What Is Apache Arrow?

The key insight behind Apache Arrow is zero-copy language interoperability. Arrow defines a stable shared-memory layout — the Arrow C Data Interface, a cross-language ABI — that any language can produce or consume by exchanging a pointer. This eliminates serialization, copies, and re-parsing. A C++ database driver and a Python DataFrame library can operate on the exact same memory without either one knowing about the other.

Built on this foundation, Arrow uses a columnar in-memory format: instead of representing a table as a list of rows (each row a collection of Python objects), Arrow stores all values for a column contiguously in a typed buffer. Nulls are tracked in a compact bitmap rather than per-cell None objects.

For a database driver, this means the entire fetch loop can run in C++ and write values directly into Arrow buffers — no Python object creation per row, no garbage-collector pressure. The DataFrame library receives a pointer to that memory and can begin operating on it immediately. Crucially, subsequent operations — filters, joins, aggregations — also work in-place on those same buffers. A Polars pipeline reading from mssql-python never needs to materialize intermediate Python objects at any stage, making Arrow the right foundation for high-throughput data processing pipelines.

Four Concrete Benefits for Users

1. Speed

The columnar fetch path avoids Python object creation per row, making fetching noticeably faster for many SQL Server types — especially temporal types like DATETIME and DATETIMEOFFSET, where Python-side per-value conversions are eliminated entirely.

Supercharging SQL Server Data Pipelines: Arrow Support in mssql-python
Source: devblogs.microsoft.com

2. Lower Memory Usage

A column of one million integers is a single contiguous C array, not a million individual Python objects. This drastically reduces memory consumption and GC overhead.

3. Seamless Interoperability

Arrow-native libraries like Polars, Pandas (via ArrowDtype), DuckDB, Hugging Face datasets, and others can consume the data directly without conversion. This streamlines multi-tool workflows.

4. Simplified Data Pipelines

By eliminating intermediate serialization steps, you can build cleaner, more efficient ETL processes. The same Arrow buffers can be passed between systems without re-formatting.

Getting Started with Arrow in mssql-python

To enable Arrow fetching, use the fetch_arrow=True parameter when executing a query. For example:

import mssql

conn = mssql.connect(server='localhost', database='test')
cursor = conn.cursor()
cursor.execute('SELECT * FROM large_table')
arrow_table = cursor.fetch_arrow()  # Returns a PyArrow Table

You can then pass arrow_table directly to Polars, Pandas, or other Arrow-compatible tools.

Polars Integration Example

import polars as pl
df = pl.from_arrow(arrow_table)  # zero-copy conversion

This avoids the overhead of creating a Python list of tuples and then constructing a DataFrame. For large datasets, the performance improvement is dramatic.

What’s Next?

The mssql-python team continues to enhance Arrow support, with plans to cover more SQL Server data types and optimize further. Community contributions are welcome — Felix Graßl’s work is a great example of how open-source collaboration accelerates innovation.

Explore the mssql-python repository for full documentation and examples.

Tags:

Recommended

Discover More

AI-Powered Code Review Unearths Long-Standing Bugs in Linux's sched_ext SchedulerThe Ultimate Human-Scale PC Build: A Step-by-Step Guide to Creating a Livable Computer CaseSave Big on This 27-Inch MSI 1440p 144Hz Monitor – Why It's a Great Buy for Gamers and CreativesSecret US Cyber Weapon 'Fast16' Sabotaged Iran's Scientific Calculations Years Before StuxnetElectric Fire Trucks Gain Traction but Fall Behind Buses, Garbage Trucks, and Drayage Fleets in EV Adoption Race