ENH: Functionality to aid with Database Imports #61247

mwiles217 · 2025-04-07T19:46:37Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

I would like the ability to have the following features to aid in Database import. 1) Unicode/Non Unicode identification for columns, 2) Max Length of each column including for inaccurate length when a multi value cell is converted is then saved by the dataframe[like list of states]. 3) Creation of the create table statements and supporting statements. 4) creation of the BCP file(tab delimited with some caveats), and its supporting FMT file and command line execution. 5) Replacement of certain chars in dataframe that prevent import (namely \r, \n, \t) in the data load step where it may be faster vs RegEx later. 6) Renaming of columns by stripping out certain characters or replacing them similar to the way that R has a rename all column function

I have code written that does most of this. For context I used dataframe as a key step for importing data into a sql server database at a rate of about 2.5 GB per hour with the philosophy of all columns as strings and provide conversion when in database as otherwise important leading zeroes could be dropped[like routing numbers or other custom indicators]. Also note that the methodology was to import into sql server using bcp which essentially is a tab delimited file. The code provided is not directly the code I used as I lost access to it. But it is a recreation and major refactoring making it simpler.

Build into perhaps describe whether or not a column contains unicode characters or not so to know to make fields varchar or nvarchar.
Build into perhaps describe getting max length of each column. NOTE. There is a discrepancy between what max length is when in a dataframe to when it is imported into a database in edge conditions when thedataframe is written back to a file. This is with multi value columns like a list of US States. The work around that works 99% of the time was multiply the length by 1.3 and then round up to nearest 100 by using math ceiling after dividing by 100 and then multiply by 100.
Number 1 and 2 should be in the same spot so as to be easily consumable to script your own rest of solution.
Perhaps when loading dataframe have option to replace certain characters to get it ready for import. it could probably be optimized there and run faster then running the regex after the fact. Im talking about replacing \r \n \t with a space or perhaps {n} and {t} respectively so they can easily be put back after import.
A built in column rename functionality that strips or replaces bad character from column headers and ensuring uniqueness. For example ( could be stripped but @ you may want to replace with at. And unicode may want to be stripped. Essentially making it so you don't need to use [] around field names in sql server script for those columns. Perhaps this function could have options as many may want renamed done differently. Expose this replace functionality so it can be used stand alone as needed for naming a table from its filename
The above i think could be good to provide building blocks for people to script the rest themselves but have the grunt part completed.
Perhaps auto add the filename without extension as well is row number when opening or saving it as sometimes you need it for debugging or you literally need to reference the previous or next row.
Have the ability to draft a create table statement from the information above and save it to a .sql file. With optionally adding an ID column and easily output it to a file. Also add in the rename of the table if it already exists by appending its timestamp including milliseconds and then transfer it to a different schema. Extra credit for creating the schema if it doesn't already exist. That part is to assist with auto complete tools and tracking changes to the data which was helpful for some disputes. Also make sure to strip naughty characters from the table name but use the filename without extension as the default table name.
Have the ability to create a bcp fmt file which is a mapping between the table and the file
Save the appropriate commands to a bat file for executing a bcp file as well as running the create table statement.
Perhaps a to_bcp option that in addition to the above several points also creates the tab delimited no header file ensuring that things like tabs and newline and form feed are replaced.
Perhaps to the last several points the to bcp function can auto create the other needed files when called.

Feature Description

The below is working code that does 95% of what I requested above. So this request ultimately isn't for me, but the community.

Things requested from above that are omitted from the code below are:

Folding into the outputted .sql file the creation of a backup schema if it doesn't exist. And then renaming the same object if found by appending the creation date of the table including miliseconds and than transferring the table to that schema.
folding in an auto ID column into the create table and then the appropriate changes to the FMT file.
folding in the addition of 2 helpful columns a) the filename without extension, and b) the row number within file
Expansion of the column rename to do smarter replacements vs just stripping the characters (like replacing @ with at), and then checking that all column names are still unique.

Also, this code generates what appears to be acceptable output. But I haven't tested within an actual database import into SQL Server.

import pandas as pd
import numpy as np
import os,re,uuid
import math
from typing import List,Dict

rx_unicode_str:str="[^\x00-\x7F]"
rx_unicode=re.compile(rx_unicode_str,re.IGNORECASE)

rx_space_str:str=r"\s"
rx_space=re.compile(rx_space_str,re.IGNORECASE)

rx_underscore_str:str="_{2,}"
rx_underscore=re.compile(rx_underscore_str,re.IGNORECASE)

rx_strip_chars_str="!|@|#|$|%|^|&|*|(|)|{|}|[|]|.|||;|:|'|"|,|<|>|?|=+"
rx_strip_chars=re.compile(rx_strip_chars_str,re.IGNORECASE)
class column_info:
def init(self,arg_column_name):
self.guid:str=str(uuid.uuid4())
self.column_name:str=arg_column_name
self.column_name_orig:str=arg_column_name
self.max_length:int=0
self.max_length_fixed:int=0
self.has_unicode:bool=False
self.sql_max:bool=False
self.ColumnIndex_1Based:int=0
self.last_column:bool=False
def as_create_table(self):
datatype:str="NVARCHAR" if self.has_unicode==True else "VARCHAR"
comma:str="," if self.ColumnIndex_1Based>1 else ""
data_length:str="MAX" if self.sql_max==True else str(self.max_length_fixed)
return "{c}[{name}] {t}({l})".format(c=comma,name=self.column_name,t=datatype,l=data_length)
def as_fmt_file(self):
datatype:str="SQLNCHAR" if self.has_unicode==True else "SQLCHAR"
comma:str="," if self.ColumnIndex_1Based>1 else ""
data_length:str=str(self.max_length_fixed)
data_length:str="4000" if self.sql_max==True and self.has_unicode==True else data_length
data_length:str="8000" if self.sql_max==True and self.has_unicode==False else data_length
# 1 SQLINT 0 4 "\t" 1 "ID" ""
idx:str=str(self.ColumnIndex_1Based).ljust(6)
type:str=datatype.ljust(20)
datalen:str=data_length.ljust(10)
name=str(""" + self.column_name + """).ljust(75)
sep:str="\t" if self.last_column==False else "\r\n"
sep=str(""" + sep + """).ljust(8)
return "{idx}{type} 0 {datalen} {sep} {name} """.format(idx=idx,type=type,datalen=datalen,sep=sep,name=name)
# return "{c}[{name}] {t}({l})".format(c=comma,name=self.column_name,t=datatype,l=data_length)

def as_dict(self):
    dict_ret={k: v for k, v in self.__dict__.items() if k not in["exclude_me"]}
    return dict_ret

class csv_info:
def init(self,arg_filename:str):
self.Database:str="myDB"
self.Server:str="myServer"
self.UserName:str="myUser"
self.Password:str="myPass"
self.filename:str=arg_filename
self.output_directory:str=""
self.bcp_filename:str=""
self.fmt_filename:str=""
self.table_name:str=""
self.parent_directory:str=""
self.filename_with_extension:str=""
self.filename_wo_extension:str=""
self.file_extension:str=""
self.parent_directory,self.filename_with_extension=os.path.split(arg_filename)
self.filename_wo_extension,self.file_extension=os.path.splitext(self.filename_with_extension)
self.table_name=self.fix_name(self.filename_wo_extension)
self.change_output_directory(self.parent_directory)

    self.df:pd.DataFrame=pd.read_csv(arg_filename,dtype=str)
    self.Columns:List[column_info]=[]
    max_lengths = self.df.apply(lambda x: x.astype(str).str.len().max())
    column_index:int=-1
    for col in self.df.columns:
        column_index+=1
        new_col=column_info(col)
        new_col.max_length=int(max_lengths.iloc[column_index])
        new_col.max_length_fixed=int(new_col.max_length_fixed * 1.3)
        new_col.max_length_fixed=100 if new_col.max_length_fixed<=100 else int(math.ceil(new_col.max_length_fixed/100)*100)
        # new_col.has_unicode=self.df[col].apply(has_unicode_regex).any()
        new_col.has_unicode=self.df[col].str.contains(rx_unicode, regex=True).any()
        new_col.sql_max=True if new_col.max_length_fixed>=8000 or (new_col.max_length_fixed>=4000 and new_col.has_unicode==True) else False
        new_col.ColumnIndex_1Based=column_index+1
        new_col.last_column=True if len(self.df.columns)==(column_index + 1) else False
        self.Columns.append(new_col)
    # end new_col.max_length_fixed>=4000 and new_col.has_unicode==True
    self.fix_column_names();
    self.to_bcp()
def change_output_directory(self,arg_output_directory:str):
    self.output_directory=arg_output_directory
    self.bcp_filename=os.path.join(self.output_directory,self.filename_wo_extension + ".bcp")
    self.fmt_filename=os.path.join(self.output_directory,self.filename_wo_extension + ".fmt")
def as_create_table(self):
    create_table:str="CREATE TABLE [{t}](\n".format(t=self.table_name)
    col:column_info
    for col in self.Columns:
        create_table += col.as_create_table() + "\n"
    of=os.path.join(self.output_directory,self.filename_wo_extension + ".sql")
    with open(of,"w",encoding="utf-8") as f:
        f.writelines(create_table + ")")
    # return create_table + ")"
def as_fmt_file(self):
    fmt_file:str="14.0\n{l}\n".format(l=str(len(self.df.columns)))
    col:column_info
    for col in self.Columns:
        fmt_file += col.as_fmt_file() + "\n"
    with open(self.fmt_filename,"w") as f:
        f.writelines(fmt_file)
    return
    #return fmt_file
def to_bcp(self):
    self.as_fmt_file()
    of:str=os.path.join(self.output_directory,self.table_name + ".bat")
    with open(of,"w",encoding="utf-8") as f:
        f.writelines(self.sql_cmd() + "\n")
        f.writelines(self.bcp_import_cmd() + "\n")
    self.df.replace(to_replace=r"\t|\n|\r",value=" ",regex=True,inplace=True)
    self.df.to_csv(path_or_buf=self.bcp_filename, sep="\t",index=None,header=False)
def fix_name(self,val:str):
    ret:str=val.strip().replace("-","_")
    ret=rx_unicode.sub("",ret)
    ret=rx_space.sub("_",ret)
    ret=rx_underscore.sub("_",ret)
    ret=rx_strip_chars.sub("",ret)
    return ret
def fix_column_names(self):
    col:column_info
    for col in self.Columns:
        col.column_name=self.fix_name(col.column_name)
def bcp_import_cmd(self):
    # -S {server_name} -U {username} -P {password}
    f_error=os.path.join(self.output_directory,self.filename_wo_extension + "_import_errors.txt")
    bcp_str:str= "bcp {db}.dbo.{t} in \"{f_bcp}\" -f {f_fmt} -T -C 65001 -S {server_name} -U {username} -P {password} -e\"{f_error}\"".format(
        db=self.Database,t=self.table_name,f_bcp=self.bcp_filename,f_fmt=self.fmt_filename,f_error=f_error
        ,server_name=self.Server,username=self.UserName,password=self.Password)
    return bcp_str

    return bcp_str
def sql_cmd(self):
    f=os.path.join(self.output_directory,self.table_name + ".sql")
    ret:str="sqlcmd -S {s} -U {u} -P {p} -i \"{f}\"".format(s=self.Server,u=self.UserName,p=self.Password,f=f)
    return ret

input_file:str=r"C:\data\LargeCSVFile\customers-2000000.csv"
cv=csv_info(input_file)

Alternative Solutions

see previous section for the alternative solution of custom written code.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

Liam3851 · 2025-04-08T20:22:49Z

Pandas core typically stays database-agnostic, calling through SQLAlchemy, and the logic above is very tightly coupled to Microsoft SQL Server. However the existing bcpandas project exists to speed SQL Server-specific (particularly BCP-based) functionality in pandas and is part of the ecosystem page. If you have additional SQL-Server-specific functionality that is not currently covered by that project, they may be interested in contributions.

mwiles217 · 2025-04-09T00:42:53Z

Understood. Thank You.

What about the 2 higher level requests that may help many with additional processing and could be classified as database agnostic?

Easily be able to tell which columns have Unicode characters in them.
Getting the true max length of a column when the column can contain a list of items like a list of states? When a data frame is then saved to a file with all the special characters needed, the length in the file can then be greater than the reported length through data frame operations. By having that length discrepancy caused an import error which required that work around I came up with which was multiply by 1.3 and then rounding up to the nearest 100 with a minimum of a 100. An import into any system would potentially have a similar issue.

Liam3851 · 2025-04-09T19:33:09Z

Re 1, pd.Series.str.is_ascii was added in #60532, this feature would presumably be the negation.

Re 2, if bcpandas does not already have functionality to set the column sizes of a new table correctly for non-ASCII characters, I imagine that may be a welcome improvement. That said also I do believe MSFT SQL Server now supports UTF-8 and the code for UTF-8 in SQL Server is varchar not nvarchar (though it does seem like you need the true byte length for the field).

mwiles217 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 7, 2025

mwiles217 changed the title ~~ENH:~~ ENH: Functionality to aid with Database Imports Apr 7, 2025

simonjayhawkins added the IO SQL to_sql, read_sql, read_sql_query label Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Functionality to aid with Database Imports #61247

ENH: Functionality to aid with Database Imports #61247

mwiles217 commented Apr 7, 2025 •

edited

Loading

Liam3851 commented Apr 8, 2025

mwiles217 commented Apr 9, 2025

Liam3851 commented Apr 9, 2025

ENH: Functionality to aid with Database Imports #61247

ENH: Functionality to aid with Database Imports #61247

Comments

mwiles217 commented Apr 7, 2025 • edited Loading

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Liam3851 commented Apr 8, 2025

mwiles217 commented Apr 9, 2025

Liam3851 commented Apr 9, 2025

mwiles217 commented Apr 7, 2025 •

edited

Loading