ENH: Functionality to aid with Database Imports #61247
Labels
Enhancement
IO SQL
to_sql, read_sql, read_sql_query
Needs Triage
Issue that has not been reviewed by a pandas team member
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
I would like the ability to have the following features to aid in Database import. 1) Unicode/Non Unicode identification for columns, 2) Max Length of each column including for inaccurate length when a multi value cell is converted is then saved by the dataframe[like list of states]. 3) Creation of the create table statements and supporting statements. 4) creation of the BCP file(tab delimited with some caveats), and its supporting FMT file and command line execution. 5) Replacement of certain chars in dataframe that prevent import (namely \r, \n, \t) in the data load step where it may be faster vs RegEx later. 6) Renaming of columns by stripping out certain characters or replacing them similar to the way that R has a rename all column function
I have code written that does most of this. For context I used dataframe as a key step for importing data into a sql server database at a rate of about 2.5 GB per hour with the philosophy of all columns as strings and provide conversion when in database as otherwise important leading zeroes could be dropped[like routing numbers or other custom indicators]. Also note that the methodology was to import into sql server using bcp which essentially is a tab delimited file. The code provided is not directly the code I used as I lost access to it. But it is a recreation and major refactoring making it simpler.
Feature Description
The below is working code that does 95% of what I requested above. So this request ultimately isn't for me, but the community.
Things requested from above that are omitted from the code below are:
Also, this code generates what appears to be acceptable output. But I haven't tested within an actual database import into SQL Server.
import pandas as pd
import numpy as np
import os,re,uuid
import math
from typing import List,Dict
rx_unicode_str:str="[^\x00-\x7F]"
rx_unicode=re.compile(rx_unicode_str,re.IGNORECASE)
rx_space_str:str=r"\s"
rx_space=re.compile(rx_space_str,re.IGNORECASE)
rx_underscore_str:str="_{2,}"
rx_underscore=re.compile(rx_underscore_str,re.IGNORECASE)
rx_strip_chars_str="!|@|#|$|%|^|&|*|(|)|{|}|[|]|.|||;|:|'|"|,|<|>|?|=+"
rx_strip_chars=re.compile(rx_strip_chars_str,re.IGNORECASE)
class column_info:
def init(self,arg_column_name):
self.guid:str=str(uuid.uuid4())
self.column_name:str=arg_column_name
self.column_name_orig:str=arg_column_name
self.max_length:int=0
self.max_length_fixed:int=0
self.has_unicode:bool=False
self.sql_max:bool=False
self.ColumnIndex_1Based:int=0
self.last_column:bool=False
def as_create_table(self):
datatype:str="NVARCHAR" if self.has_unicode==True else "VARCHAR"
comma:str="," if self.ColumnIndex_1Based>1 else ""
data_length:str="MAX" if self.sql_max==True else str(self.max_length_fixed)
return "{c}[{name}] {t}({l})".format(c=comma,name=self.column_name,t=datatype,l=data_length)
def as_fmt_file(self):
datatype:str="SQLNCHAR" if self.has_unicode==True else "SQLCHAR"
comma:str="," if self.ColumnIndex_1Based>1 else ""
data_length:str=str(self.max_length_fixed)
data_length:str="4000" if self.sql_max==True and self.has_unicode==True else data_length
data_length:str="8000" if self.sql_max==True and self.has_unicode==False else data_length
# 1 SQLINT 0 4 "\t" 1 "ID" ""
idx:str=str(self.ColumnIndex_1Based).ljust(6)
type:str=datatype.ljust(20)
datalen:str=data_length.ljust(10)
name=str(""" + self.column_name + """).ljust(75)
sep:str="\t" if self.last_column==False else "\r\n"
sep=str(""" + sep + """).ljust(8)
return "{idx}{type} 0 {datalen} {sep} {name} """.format(idx=idx,type=type,datalen=datalen,sep=sep,name=name)
# return "{c}[{name}] {t}({l})".format(c=comma,name=self.column_name,t=datatype,l=data_length)
class csv_info:
def init(self,arg_filename:str):
self.Database:str="myDB"
self.Server:str="myServer"
self.UserName:str="myUser"
self.Password:str="myPass"
self.filename:str=arg_filename
self.output_directory:str=""
self.bcp_filename:str=""
self.fmt_filename:str=""
self.table_name:str=""
self.parent_directory:str=""
self.filename_with_extension:str=""
self.filename_wo_extension:str=""
self.file_extension:str=""
self.parent_directory,self.filename_with_extension=os.path.split(arg_filename)
self.filename_wo_extension,self.file_extension=os.path.splitext(self.filename_with_extension)
self.table_name=self.fix_name(self.filename_wo_extension)
self.change_output_directory(self.parent_directory)
input_file:str=r"C:\data\LargeCSVFile\customers-2000000.csv"
cv=csv_info(input_file)
Alternative Solutions
see previous section for the alternative solution of custom written code.
Additional Context
No response
The text was updated successfully, but these errors were encountered: