5
\$\begingroup\$

The idea is that I have a subset of the movements of material of a factory (over 700k rows), and I want to separate these movements by the reference of each product to plot the stock of each product over time.

But the dataframe has repeated days, and the stock at the end of every month (STOCK FINAL) is in another dataframe (inventario_df). I get the final_stock from the other dataframe because I wanted to see if the data from the two dataframes are the same (in the 1st output it is not the same, in the 2nd it is, so now I know that the company has some kind of problem with the reference of the 1st output).

example data:

MvtosMaterial.xlsx:

Referencia Fecha Stock
13017094 20170507 23857.00
13037094 20170507 10733.00
13127094 20170101 10.00
13127094 20170101 20.00
13127094 20170101 0.00
14017094 20170507 9395.00
14047094 20170507 88847.00
15017094 20170507 10157.00

Inventario.xlsx:

AÑO_INVENTARIO MES_INVENTARIO REFERENCIA STOCK FINAL
2017 1 13017094 17644
2017 1 13037094 5948
2017 1 13127094 10558
2017 1 14017094 15502
2017 1 14047094 44453
2017 1 15017094 10214
2017 5 13017094 17962
2017 5 13037094 3588
2017 5 14017094 13970
2017 5 14047094 88066
2017 5 15017094 8830

And the intended output is the groups separate by the reference, summing the quantities of stock by day, and the last day of each month with a final stock. This value is taken from the other dataframe (inventory_df), so one of the output groups would be:

 Referencia Fecha Stock Stock_final
101673 1304790 2015年01月05日 135746.0 0.0
101745 1304790 2015年01月07日 129353.0 0.0
101834 1304790 2015年01月08日 182706.0 0.0
...
103123 1304790 2015年01月29日 140634.0 0.0
103213 1304790 2015年01月30日 194103.0 0.0
103226 1304790 2015年01月31日 193383.0 253039.0
103306 1304790 2015年02月02日 187932.0 0.0
103382 1304790 2015年02月03日 182183.0 0.0
...

and another group:

 Referencia Fecha Stock Stock_final
51804 13017287 2016年05月26日 2000.0 2000.0
51805 13017287 2016年06月23日 1764.0 1764.0
51806 13017287 2016年10月19日 1586.0 0.0
51807 13017287 2016年10月24日 1200.0 1200.0
51808 13017287 2017年01月19日 1079.0 0.0
51809 13017287 2017年01月20日 938.0 0.0
...

code:

import pandas as pd
path = '../Data/'
mvtos_material_df = pd.read_excel(path + 'MvtosMaterial.xlsx')
inventario_df = pd.read_excel(path + 'Inventario.xlsx')
mvtos_material_df['Fecha'] = pd.to_datetime(mvtos_material_df['Fecha'], format='%Y%m%d')
inventario_df = inventario_df.rename(index = str, columns = {'MES_INVENTARIO': 'Month', 'AÑO_INVENTARIO': 'Year'})
inventario_df['Fecha'] = pd.to_datetime(inventario_df[['Year', 'Month']].assign(Day=1), format='%Y%m%d')
inventario_df.drop(['Month', 'Year'], axis = 1, inplace = True)
mvtos_dict = {}
stock_finales = {}
mvtos_material_df['Stock_final'] = 0.0
for name, group in mvtos_material_df.groupby('Referencia'): 
 mvtos_dict[str(name)] = pd.DataFrame()
 # for loop to fill moves_dict with the orders per day
 for k, v in group.groupby(pd.Grouper(key='Fecha', freq='M')):
 if not v.empty:
 # Comparing dates in inventory and movements to get the final stock
 a = (inventario_df[inventario_df['REFERENCIA'].values == v.tail(1)['Referencia'].values])
 a = (a[a['Fecha'].dt.month.values == v.tail(1)['Fecha'].dt.month.values])
 a = (a[a['Fecha'].dt.year.values == v.tail(1)['Fecha'].dt.year.values]) 
 # sum all days together and create the ones which are missing
 for i, j in v.groupby(pd.Grouper(key='Fecha', freq='D')): 
 temp = j.tail(1).copy() # We need to get the last row 
 mvtos_dict[str(name)] = mvtos_dict[str(name)].append(j.tail(1)) 
 if not a.empty:
 # Drop last row (last day of the month) and add the one with the final stock
 mvtos_dict[str(name)].drop(mvtos_dict[str(name)].index[len(mvtos_dict[str(name)])-1], inplace = True) 
 temp['Stock_final'] = float(a['STOCK FINAL']) 
 mvtos_dict[str(name)] = mvtos_dict[str(name)].append(temp)

In the code I have now, I have separated the groups in a dictionary so that I can access them later. Although I mainly used the dictionary because I didn't know how to do all this in the same dataframe.

On the second group, the first 2 rows are the last days of each month because they didn't have any other move between on mvtos_material_df.

Also, for plotting the graphs I use this code:

for ke, v in mvtos_dict.items():
 x = v['Fecha']
 y1 = v['Stock']
 y2 = v['Stock_final'].replace(0, np.nan)
 # I had to use this to fix a problem with the axis
 # but now seems is not needed
 #pd.plotting.deregister_matplotlib_converters()
 fig, ax = plt.subplots(figsize=(18, 8))
 ax.plot(x, y1, c='red',alpha=0.6, label='Movimientos de material')
 ax.scatter(x = x, y = y2, alpha=1, marker='x', label='Stock Final')
 ax.xaxis.grid(True)
 myFmt = mdates.DateFormatter("%Y-%m-%d")
 start, end = ax.get_xlim()
 ax.xaxis.set_ticks(np.arange(start, end, 30))
 ax.xaxis.set_major_formatter(myFmt)
 ## Rotate date labels automatically
 fig.autofmt_xdate()
 plt.legend(loc='best')
 ax.set_title(f'Gráfica del componente {ke}')
 plt.show()
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Dec 17, 2018 at 7:41
\$\endgroup\$
8
  • 1
    \$\begingroup\$ Please provide a better explanation of what this code accomplishes, including example inputs and the intended output. \$\endgroup\$ Commented Dec 17, 2018 at 17:37
  • \$\begingroup\$ @200_success done, ty for taking the time to check it \$\endgroup\$ Commented Dec 18, 2018 at 8:01
  • \$\begingroup\$ I think what would help your question greatly is a better title. You should change your title to state what the code achieves (group references by something) and just state your current code with example data. In addition, asking about whether or not code is correct or for specific changes to the code are off-topic here. However, asking for general improvements is on-topic. Any reviewer will most likely give you a more vectorized version anyways (but they might also comment on other things). \$\endgroup\$ Commented Dec 18, 2018 at 10:01
  • \$\begingroup\$ @Graipher ty for the input, I'll try to rework the post to include all the code, the example data is better to put it here or with a link to a gist? Because I was thinking that I could share a big chunk of data instead of only 10 rows but here I think it will be too much \$\endgroup\$ Commented Dec 18, 2018 at 10:59
  • \$\begingroup\$ @set92 A few rows is usually enough (as long as all relevant parts are there, e.g. at least two groups and two entries per group/enough for the function to work. \$\endgroup\$ Commented Dec 18, 2018 at 11:08

1 Answer 1

4
\$\begingroup\$

DataFrame.append has been deprecated and is no longer available, so your code does not run.

If you use a pathlib.Path and make it a function parameter, you can use the / operator instead of string concatenation + to get your spreadsheet path.

Pandas pays attention to your locale, and generally you should set it explicitly at the beginning. My guess is es_ES.

Add functions with well-defined inputs and outputs, and don't leave any code in the global namespace.

A minimal refactor that addresses these issues and gets your code to run can look like

import locale
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.dates
from matplotlib import pyplot as plt
def load_materials(path: Path = Path.cwd(), name: str = 'MvtosMaterial.xlsx') -> pd.DataFrame:
 return pd.read_excel(path / name)
def load_inventory(path: Path = Path.cwd(), name: str = 'Inventario.xlsx') -> pd.DataFrame:
 return pd.read_excel(path / name)
def process(
 mvtos_material_df: pd.DataFrame,
 inventario_df: pd.DataFrame,
) -> dict[str, pd.DataFrame]:
 mvtos_material_df['Fecha'] = pd.to_datetime(mvtos_material_df['Fecha'], format='%Y%m%d')
 inventario_df = inventario_df.rename(index = str, columns = {'MES_INVENTARIO': 'Month', 'AÑO_INVENTARIO': 'Year'})
 inventario_df['Fecha'] = pd.to_datetime(inventario_df[['Year', 'Month']].assign(Day=1), format='%Y%m%d')
 inventario_df.drop(['Month', 'Year'], axis = 1, inplace = True)
 mvtos_dict = {}
 mvtos_material_df['Stock_final'] = 0.0
 for name, group in mvtos_material_df.groupby('Referencia'):
 mvtos_dict[str(name)] = pd.DataFrame()
 # for loop to fill moves_dict with the orders per day
 for k, v in group.groupby(pd.Grouper(key='Fecha', freq='M')):
 if not v.empty:
 # Comparing dates in inventory and movements to get the final stock
 a = (inventario_df[inventario_df['REFERENCIA'].values == v.tail(1)['Referencia'].values])
 a = (a[a['Fecha'].dt.month.values == v.tail(1)['Fecha'].dt.month.values])
 a = (a[a['Fecha'].dt.year.values == v.tail(1)['Fecha'].dt.year.values])
 # sum all days together and create the ones which are missing
 for i, j in v.groupby(pd.Grouper(key='Fecha', freq='D')):
 temp = j.tail(1).copy() # We need to get the last row
 mvtos_dict[str(name)] = pd.concat(
 (mvtos_dict[str(name)], j.tail(1)),
 axis='rows',
 )
 if not a.empty:
 # Drop last row (last day of the month) and add the one with the final stock
 mvtos_dict[str(name)].drop(mvtos_dict[str(name)].index[len(mvtos_dict[str(name)])-1], inplace = True)
 temp['Stock_final'] = float(a['STOCK FINAL'])
 mvtos_dict[str(name)] = pd.concat(
 (mvtos_dict[str(name)], temp)
 )
 return mvtos_dict
def plot(mvtos_dict: dict) -> None:
 for ke, v in mvtos_dict.items():
 x = v['Fecha']
 y1 = v['Stock']
 y2 = v['Stock_final'].replace(0, np.nan)
 # I had to use this to fix a problem with the axis
 # but now seems is not needed
 # pd.plotting.deregister_matplotlib_converters()
 fig, ax = plt.subplots(figsize=(18, 8))
 ax.plot(x, y1, c='red', alpha=0.6, label='Movimientos de material')
 ax.scatter(x=x, y=y2, alpha=1, marker='x', label='Stock Final')
 ax.xaxis.grid(True)
 myFmt = matplotlib.dates.DateFormatter("%Y-%m-%d")
 start, end = ax.get_xlim()
 ax.xaxis.set_ticks(np.arange(start, end, 30))
 ax.xaxis.set_major_formatter(myFmt)
 ## Rotate date labels automatically
 fig.autofmt_xdate()
 plt.legend(loc='best')
 ax.set_title(f'Gráfica del componente {ke}')
def main() -> None:
 # This is only a guess; there may be a more accurate locale in context.
 # Pandas pays attention to this (even if there is no effect for this particular dataset).
 locale.setlocale(category=locale.LC_ALL, locale='es_ES')
 # OP uses ../Data as path instead
 mvtos_material_df = load_materials()
 inventario_df = load_inventory()
 mvtos_dict = process(mvtos_material_df, inventario_df)
 plot(mvtos_dict)
 plt.show()
if __name__ == '__main__':
 main()

producing an mvtos_dict that looks like this:

for key, df in mvtos_dict.items():
 print(f'{key}:')
 print(df)
13017094:
 Referencia Fecha Stock Stock_final
0 13017094 2017年05月07日 23857 17962.0
13037094:
 Referencia Fecha Stock Stock_final
1 13037094 2017年05月07日 10733 3588.0
13127094:
 Referencia Fecha Stock Stock_final
4 13127094 2017年01月01日 0 10558.0
14017094:
 Referencia Fecha Stock Stock_final
5 14017094 2017年05月07日 9395 13970.0
14047094:
 Referencia Fecha Stock Stock_final
6 14047094 2017年05月07日 88847 88066.0
15017094:
 Referencia Fecha Stock Stock_final
7 15017094 2017年05月07日 10157 8830.0

That output does not match your reference output, because you showed reference output that uses inputs not contained in your reference input. As such, the rest of this review will be descriptive only since I can't verify it.

Rather than calling pd.to_datetime, use read_excel's parse_dates/date_format support.

The output format of a dictionary of dataframes of one row each is strange and awkward. Just construct one dataframe, potentially with additional index levels if needed.

freq='M' is deprecated and needs to be replaced probably with MS or ME.

Instead of v.tail(1) I have a minor preference for .iloc[-1] though they should be interchangeable.

answered Feb 17 at 16:06
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.