I try to access the FTP server to get the CAGED files. But as an error, it shows that the files are corrupted. When I try to access the message "Corrupt input data". So I'm unsure whether, in fact, the problem is with the code or with the FTP server that the script accesses to download the data.
The question is. Has anyone had a problem similar to this on an FTP server?
This is the FTP link that I ́m trying to access: ftp://ftp.mtps.gov.br/pdet/microdados/
the path to download the "dicionario_dados.xlsx" file is within the ftp server: ftp://ftp.mtps.gov.br/pdet/microdados/NOVO%20CAGED/Legado/Movimenta%C3%A7%C3%B5es/Layout%20Novo%20Caged%20Movimenta%C3%A7%C3%A3o.xlsx
from os import remove
from py7zr import SevenZipFile
import pandas as pd
import wget
import numpy as np
import warnings
excel = pd.ExcelFile("data/dicionario_dados.xlsx")
get_dict = lambda x: pd.read_excel(excel, sheet_name=x)
data_dict = {
sheet: {row[1]: row[2] for row in
get_dict(sheet).itertuples()}
for sheet in excel.sheet_names[1:]
}
url = lambda year, month: f"ftp://ftp.mtps.gov.br/pdet/microdados/NOVO CAGED/{year}/{year}{month:02d}/CAGEDMOV{year}{month:02d}.7z"
dfs = []
start_year = 2020
start_month = 4
dates = []
#dates = data["competênciamov"].unique()
for year in range(start_year, 2025):
for month in range(start_month, 13):
if f"{year}-{month:02d}-01" in dates:
continue
try:
print(f"{month:02d}/{year}")
wget.download(url(year, month), 'caged.7z')
archive = SevenZipFile('caged.7z', mode = 'r')
print('Microdata downloaded successfully, ready for reading')
for name, fd in archive.read(name for name in archive.getnames() if name.endswith(".txt")).items():
caged_raw = pd.read_csv(fd, delimiter=";", decimal=",")
caged_raw = caged_raw.loc[caged_raw["uf"] == 25, :].reset_index(drop=True)
for col in caged_raw.columns:
if col in data_dict:
caged_raw[f"{col}_cod"] = caged_raw[col]
caged_raw[col] = caged_raw[col].apply(lambda x: data_dict[col][x]
if x in data_dict[col] else np.nan)
dfs.append(caged_raw)
archive.close()
remove('caged.7z')
print('Reading completed successfully')
except Exception as e:
print(f'Error processing {month:02d}/{year}: {e}')
print('Microdata for the selected month is not yet available')
break
output response of the code:
04/2020
Microdata downloaded successfully, ready for reading
Error processing 04/2020: Corrupt input data
Microdata for the selected month is not yet available
04/2021
Microdata downloaded successfully, ready for reading
Error processing 04/2021: Corrupt input data
Microdata for the selected month is not yet available
04/2022
Microdata downloaded successfully, ready for reading
Error processing 04/2022: Corrupt input data
Microdata for the selected month is not yet available
04/2023
Microdata downloaded successfully, ready for reading
Error processing 04/2023: Corrupt input data
Microdata for the selected month is not yet available
04/2024
Microdata downloaded successfully, ready for reading
Error processing 04/2024: Corrupt input data
Microdata for the selected month is not yet available
I tried modifying the code and checking the files on the ftp server
-
Try manually the individual steps - download, uncompress, open file. Will you get the same files, same result?VPfB– VPfB2024年09月28日 11:35:04 +00:00Commented Sep 28, 2024 at 11:35
-
Do you get the problem only with your complicated code with four nested loops? Or also when downloading a single file? If the latter, we need minimal reproducible example, not your complete program.Martin Prikryl– Martin Prikryl2024年09月28日 16:24:48 +00:00Commented Sep 28, 2024 at 16:24