0

I try to access the FTP server to get the CAGED files. But as an error, it shows that the files are corrupted. When I try to access the message "Corrupt input data". So I'm unsure whether, in fact, the problem is with the code or with the FTP server that the script accesses to download the data.

The question is. Has anyone had a problem similar to this on an FTP server?

This is the FTP link that I ́m trying to access: ftp://ftp.mtps.gov.br/pdet/microdados/

the path to download the "dicionario_dados.xlsx" file is within the ftp server: ftp://ftp.mtps.gov.br/pdet/microdados/NOVO%20CAGED/Legado/Movimenta%C3%A7%C3%B5es/Layout%20Novo%20Caged%20Movimenta%C3%A7%C3%A3o.xlsx

from os import remove
from py7zr import SevenZipFile
import pandas as pd
import wget
import numpy as np
import warnings

excel = pd.ExcelFile("data/dicionario_dados.xlsx")
get_dict = lambda x: pd.read_excel(excel, sheet_name=x)
data_dict = {
 sheet: {row[1]: row[2] for row in 
 get_dict(sheet).itertuples()}
 for sheet in excel.sheet_names[1:]
}
url = lambda year, month: f"ftp://ftp.mtps.gov.br/pdet/microdados/NOVO CAGED/{year}/{year}{month:02d}/CAGEDMOV{year}{month:02d}.7z" 

dfs = []
start_year = 2020
start_month = 4
dates = []
#dates = data["competênciamov"].unique()
for year in range(start_year, 2025):
 for month in range(start_month, 13):
 if f"{year}-{month:02d}-01" in dates:
 continue
 try:
 print(f"{month:02d}/{year}")
 wget.download(url(year, month), 'caged.7z')
 archive = SevenZipFile('caged.7z', mode = 'r')
 print('Microdata downloaded successfully, ready for reading')
 for name, fd in archive.read(name for name in archive.getnames() if name.endswith(".txt")).items():
 caged_raw = pd.read_csv(fd, delimiter=";", decimal=",")
 caged_raw = caged_raw.loc[caged_raw["uf"] == 25, :].reset_index(drop=True)
 for col in caged_raw.columns:
 if col in data_dict:
 caged_raw[f"{col}_cod"] = caged_raw[col]
 caged_raw[col] = caged_raw[col].apply(lambda x: data_dict[col][x] 
 if x in data_dict[col] else np.nan)
 dfs.append(caged_raw)
 archive.close()
 remove('caged.7z')
 print('Reading completed successfully')
 except Exception as e:
 print(f'Error processing {month:02d}/{year}: {e}')
 print('Microdata for the selected month is not yet available')
 break

output response of the code:

04/2020
Microdata downloaded successfully, ready for reading
Error processing 04/2020: Corrupt input data
Microdata for the selected month is not yet available
04/2021
Microdata downloaded successfully, ready for reading
Error processing 04/2021: Corrupt input data
Microdata for the selected month is not yet available
04/2022
Microdata downloaded successfully, ready for reading
Error processing 04/2022: Corrupt input data
Microdata for the selected month is not yet available
04/2023
Microdata downloaded successfully, ready for reading
Error processing 04/2023: Corrupt input data
Microdata for the selected month is not yet available
04/2024
Microdata downloaded successfully, ready for reading
Error processing 04/2024: Corrupt input data
Microdata for the selected month is not yet available

I tried modifying the code and checking the files on the ftp server

2
  • Try manually the individual steps - download, uncompress, open file. Will you get the same files, same result? Commented Sep 28, 2024 at 11:35
  • Do you get the problem only with your complicated code with four nested loops? Or also when downloading a single file? If the latter, we need minimal reproducible example, not your complete program. Commented Sep 28, 2024 at 16:24

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.