python-pandas and databases like mysql

Question 1

The documentation for Pandas has numerous examples of best practices for working with data stored in various formats.

However, I am unable to find any good examples for working with databases like MySQL for example.

Can anyone point me to links or give some code snippets of how to convert query results using mysql-python to data frames in Pandas efficiently ?

Question 2

Documentation: pandas.pydata.org/pandas-docs/stable/io.html#sql-queries

Question 3

See also: stackoverflow.com/questions/15231646/…

Question 4

Also take a look at Blaze.

Question 5

If you're willing to spend money, I believe that Wes McKinney's book ("Python for Data Analysis") has some useful examples.

Question 6

As Wes says, io/sql's read_sql will do it, once you've gotten a database connection using a DBI compatible library. We can look at two short examples using the MySQLdb and cx_Oracle libraries to connect to Oracle and MySQL and query their data dictionaries. Here is the example for cx_Oracle:

import pandas as pd
import cx_Oracle
ora_conn = cx_Oracle.connect('your_connection_string')
df_ora = pd.read_sql('select * from user_objects', con=ora_conn) 
print 'loaded dataframe from Oracle. # Records: ', len(df_ora)
ora_conn.close()

And here is the equivalent example for MySQLdb:

import MySQLdb
mysql_cn= MySQLdb.connect(host='myhost', 
 port=3306,user='myusername', passwd='mypassword', 
 db='information_schema')
df_mysql = pd.read_sql('select * from VIEWS;', con=mysql_cn) 
print 'loaded dataframe from MySQL. records:', len(df_mysql)
mysql_cn.close()

Question 7

For recent readers of this question: pandas have the following warning in their docs for version 14.0:

Warning: Some of the existing functions or function aliases have been deprecated and will be removed in future versions. This includes: tquery, uquery, read_frame, frame_query, write_frame.

And:

Warning: The support for the ‘mysql’ flavor when using DBAPI connection objects has been deprecated. MySQL will be further supported with SQLAlchemy engines (GH6900).

This makes many of the answers here outdated. You should use sqlalchemy:

from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('dialect://user:pass@host:port/schema', echo=False)
f = pd.read_sql_query('SELECT * FROM mytable', engine, index_col = 'ID')

Question 8

loading a table with 133 rows and 7 columns takes around 30 secs.. can you give some insights regarding why is that?

Question 9

@idoda [in general this is not the question's topic and it's better to ask a new question so you'd get more opinions]. Are you sure this is not a matter of request delay? Is simply sending the query and retrieving the results significantly faster?

Question 10

@Korem I did thought about opening a new one, but I wanted to make sure it is not a trivial one first. When I use an mySql client (Sequel pro) and query the database, reuslts come up much faster. When you say "simply sending and then retrieving", is that what you mean? (using a client)

Question 11

@idoda I mean comparing the time it takes to execute engine.execute("select * FROM mytable") with the time it takes to execute pd.read_sql_query('SELECT * FROM mytable', engine)

Question 12

Can one pass a sqlalchemy query (session.query as in my answer below) directly to a pandas method? That would be a ripper!

Question 13

For the record, here is an example using a sqlite database:

import pandas as pd
import sqlite3
with sqlite3.connect("whatever.sqlite") as con:
 sql = "SELECT * FROM table_name"
 df = pd.read_sql_query(sql, con)
 print df.shape

Question 14

You can specify the column to use as an index by specifying index_col='timestamp' in frame_query.

Question 15

I prefer to create queries with SQLAlchemy, and then make a DataFrame from it. SQLAlchemy makes it easier to combine SQL conditions Pythonically if you intend to mix and match things over and over.

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Table
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from pandas import DataFrame
import datetime
# We are connecting to an existing service
engine = create_engine('dialect://user:pwd@host:port/db', echo=False)
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base()
# And we want to query an existing table
tablename = Table('tablename', 
 Base.metadata, 
 autoload=True, 
 autoload_with=engine, 
 schema='ownername')
# These are the "Where" parameters, but I could as easily 
# create joins and limit results
us = tablename.c.country_code.in_(['US','MX'])
dc = tablename.c.locn_name.like('%DC%')
dt = tablename.c.arr_date >= datetime.date.today() # Give me convenience or...
q = session.query(tablename).\
 filter(us & dc & dt) # That's where the magic happens!!!
def querydb(query):
 """
 Function to execute query and return DataFrame.
 """
 df = DataFrame(query.all());
 df.columns = [x['name'] for x in query.column_descriptions]
 return df
querydb(q)

Question 16

Also you must specify the driver if it's not the same as the SQLAlchemy's default one: dialect+driver://user:pwd@host:port/db

Question 17

MySQL example:

import MySQLdb as db
from pandas import DataFrame
from pandas.io.sql import frame_query
database = db.connect('localhost','username','password','database')
data = frame_query("SELECT * FROM data", database)

Question 18

frame_query is now deprecated. Now use pd.read_sql(query, db) instead.

Question 19

The same syntax works for Ms SQL server using podbc also.

import pyodbc
import pandas.io.sql as psql
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER=servername;DATABASE=mydb;UID=username;PWD=password') 
cursor = cnxn.cursor()
sql = ("""select * from mytable""")
df = psql.frame_query(sql, cnxn)
cnxn.close()

Question 20

And this is how you connect to PostgreSQL using psycopg2 driver (install with "apt-get install python-psycopg2" if you're on Debian Linux derivative OS).

import pandas.io.sql as psql
import psycopg2
conn = psycopg2.connect("dbname='datawarehouse' user='user1' host='localhost' password='uberdba'")
q = """select month_idx, sum(payment) from bi_some_table"""
df3 = psql.frame_query(q, conn)

Question 21

For Sybase the following works (with http://python-sybase.sourceforge.net)

import pandas.io.sql as psql
import Sybase
df = psql.frame_query("<Query>", con=Sybase.connect("<dsn>", "<user>", "<pwd>"))

Question 22

pandas.io.sql.frame_query is deprecated. Use pandas.read_sql instead.

Question 23

import the module

import pandas as pd
import oursql

connect

conn=oursql.connect(host="localhost",user="me",passwd="mypassword",db="classicmodels")
sql="Select customerName, city,country from customers order by customerName,country,city"
df_mysql = pd.read_sql(sql,conn)
print df_mysql

That works just fine and using pandas.io.sql frame_works (with the deprecation warning). Database used is the sample database from mysql tutorial.

Question 24

This should work just fine.

import MySQLdb as mdb
import pandas as pd
con = mdb.connect(‘127.0.0.1’, ‘root’, ‘password’, ‘database_name’);
with con:
 cur = con.cursor()
 cur.execute("select random_number_one, random_number_two, random_number_three from randomness.a_random_table")
 rows = cur.fetchall()
 df = pd.DataFrame( [[ij for ij in i] for i in rows] )
 df.rename(columns={0: ‘Random Number One’, 1: ‘Random Number Two’, 2: ‘Random Number Three’}, inplace=True);
 print(df.head(20))

Question 25

This helped for me for connecting to AWS MYSQL(RDS) from python 3.x based lambda function and loading into a pandas DataFrame

import json
import boto3
import pymysql
import pandas as pd
user = 'username'
password = 'XXXXXXX'
client = boto3.client('rds')
def lambda_handler(event, context):
 conn = pymysql.connect(host='xxx.xxxxus-west-2.rds.amazonaws.com', port=3306, user=user, passwd=password, db='database name', connect_timeout=5)
 df= pd.read_sql('select * from TableName limit 10',con=conn)
 print(df)
 # TODO implement
 #return {
 # 'statusCode': 200,
 # 'df': df
 #}

Question 26

For Postgres users

import psycopg2
import pandas as pd
conn = psycopg2.connect("database='datawarehouse' user='user1' host='localhost' password='uberdba'")
customers = 'select * from customers'
customers_df = pd.read_sql(customers,conn)
customers_df

Question 27

Could you point out the difference to the answer of @Will and why your solution should be chosen?

Question 28

Using mysql.connector, you could write something like this:

import mysql.connector
import pandas as pd
# Database credentials
DB_HOST = 'host_ip'
DB_NAME = 'db_name or schema'
DB_USER = 'user_name'
DB_PASS = 'password'
try:
 # Connect to the database
 conn = mysql.connector.connect(
 host=DB_HOST,
 database=DB_NAME,
 user=DB_USER,
 password=DB_PASS
 )
 # Create a cursor object to execute SQL queries
 cursor = conn.cursor()
 # Example query
 query = "SELECT * FROM your_table"
 # Execute the query
 cursor.execute(query)
 # Fetch all the rows
 rows = cursor.fetchall()
 # Get column names
 column_names = [desc[0] for desc in cursor.description]
 # Create a DataFrame from the fetched rows and column names
 df = pd.DataFrame(rows, columns=column_names)
 # Process or analyze the DataFrame as needed
 print(df)
 # Close the cursor and connection
 cursor.close()
 conn.close()
except mysql.connector.Error as error:
 print(f"Failed to connect to MySQL: {error}")

Question 29

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Keith C Campbell 1,0461 gold badge7 silver badges3 bronze badges · Accepted Answer · 2012-06-21 12:34:29Z

As Wes says, io/sql's read_sql will do it, once you've gotten a database connection using a DBI compatible library. We can look at two short examples using the MySQLdb and cx_Oracle libraries to connect to Oracle and MySQL and query their data dictionaries. Here is the example for cx_Oracle:

import pandas as pd
import cx_Oracle
ora_conn = cx_Oracle.connect('your_connection_string')
df_ora = pd.read_sql('select * from user_objects', con=ora_conn) 
print 'loaded dataframe from Oracle. # Records: ', len(df_ora)
ora_conn.close()

And here is the equivalent example for MySQLdb:

import MySQLdb
mysql_cn= MySQLdb.connect(host='myhost', 
 port=3306,user='myusername', passwd='mypassword', 
 db='information_schema')
df_mysql = pd.read_sql('select * from VIEWS;', con=mysql_cn) 
print 'loaded dataframe from MySQL. records:', len(df_mysql)
mysql_cn.close()

CollectivesTM on Stack Overflow

python-pandas and databases like mysql

14 Answers 14

Comments

5 Comments

1 Comment

1 Comment

1 Comment

Comments

Comments

Comments

Comments

import the module

connect

Comments

Comments

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

14 Answers 14

Comments

5 Comments

1 Comment

1 Comment

1 Comment

Comments

Comments

Comments

Comments

import the module

connect

Comments

Comments

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related