q_2017_2019 = workingDir +”/2019ProductionUpdate5.sql”
q_2017_2019= open(q_2017_2019, ‘r’)
sql_2019Prod = ” “.join(q_2017_2019.readlines())
q_2017_2019.close()
df_2019Prod =pd.read_sql_query(sql_2019Prod,conn)
Python: Filter Data
idx = df_2017Prod_weight[‘YEAR’]==’2017′
df_2017Prod_weight_1=df_2017Prod_weight[idx]
Python: Find the Most Recent Year’s Data
df_2019Prod_weight[‘Year_max’]=df_2019Prod_weight.groupby([‘COMMODITY_DESC’, ‘CLASS_DESC’,’UNIT_DESC’])[‘YEAR’].transform(max)
idx = df_2019Prod_weight[‘Year_max’]==df_2019Prod_weight[‘YEAR’]
df_2019Prod_weight_1=df_2019Prod_weight[idx]
sql list all table and schema names
SELECT *
FROM INFORMATION_SCHEMA.TABLES
order by TABLE_TYPE, TABLE_SCHEMA
Find optimal number of clusters
Visualizing High Dimensional Clusters
https://www.kaggle.com/minc33/visualizing-high-dimensional-clusters#Introduction:
Python Frequently Used Commands
There are some Python commands I almost apply in each of my script to adjust the default settings, to set up directories.
pd.set_option(‘display.max_columns’, 1000)
pd.set_option(‘display.max_row’, 1000)
pd.set_option(‘display.width’, None)
pd.set_option(‘display.max_colwidth’, -1)
WorkingDir = r”C:\Users\jing\***\Competition”
DataDir = WorkingDir+”\\data”
Python Plot
Scatter plot:
import matplotlib.pyplot as plt
x_plot=Y[:,0]
y_plot=Y[:,1]
# y_plot=pd.DataFrame(y_plot)
plt.plot(y_plot, x_plot, 'o', color='black');
2nd example:
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go
from plotly.offline import *
trace3d = go.Scatter3d(
x = G_array[:,100],
y = G_array[:,101],
z = G_array[:,201],
mode = "markers",
name = "Cluster 0",
marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
text = None)
fig = dict(data = trace3d)
iplot(fig)
plotly.offline.plot(fig, filename='3d_G.html')
Other resources:
Research Data
Here are some public datasets I used:
U.S. Population:
Agricultural Data:
https://quickstats.nass.usda.gov/
It has yield, production, price, etc, agriculture related data.
Water Data:
Python Pandas Create a Dataframe
I usually create dataframes by reading data from csv, excel and sql. You can find some examples here:
python reads data from SQL directly; python executes sql query
Python Pandas read csv common problems
Python Pandas read excel files (xlsx)
But sometimes, you may want to create a small dataframe in the script. So here are some examples:
df_pop = pd.DataFrame({'Year':['2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019'],
'Population':['309321666', '311556874', '313830990', '315993715', '318301008', '320635163', '322941311', '324985539', '326687501', '328239523']})
You can skip the quotes to have them as int.
df_pop = pd.DataFrame({'Year':[2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019],
'Population':[309321666, 311556874, 313830990, 315993715, 318301008, 320635163, 322941311, 324985539, 326687501, 328239523]})
You can also convert an array to a dataframe:
Y_df=pd.DataFrame(Y)
Other Pandas related blogs: