Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Consolidate and clean-up Fetch CSV file interactions #282

@TimidRobot

Description

Problem

  1. Only the GCS fetch script benefits from CSV file initialization. The arXiv script shouldn't have copied this pattern. It should be removed:
  • def initialize_data_file(file_path, headers):
    """Initialize CSV file with headers if it doesn't exist."""
    if not os.path.isfile(file_path):
    with open(file_path, "w", encoding="utf-8", newline="\n") as file_obj:
    writer = csv.DictWriter(
    file_obj, fieldnames=headers, dialect="unix"
    )
    writer.writeheader()
  1. The various fetch scripts duplicate a lot of code between them when they save their data:
    • def rows_to_csv(args, fieldnames, rows, file_path):
      if not args.enable_save:
      return args
      with open(file_path, "w", encoding="utf-8", newline="\n") as file_handle:
      writer = csv.DictWriter(
      file_handle, fieldnames=fieldnames, dialect="unix"
      )
      writer.writeheader()
      for row in rows:
      writer.writerow(row)
    • with open(file_path, "a", encoding="utf-8", newline="\n") as file_obj:
      writer = csv.DictWriter(
      file_obj, fieldnames=fieldnames, dialect="unix"
      )
      writer.writerow(row)
    • with open(FILE1_COUNT, "w", encoding="utf-8", newline="\n") as file_obj:
      writer = csv.DictWriter(
      file_obj, fieldnames=HEADER1_COUNT, dialect="unix"
      )
      writer.writeheader()
      for row in tool_data:
      writer.writerow(row)
    • def write_data(args, data):
      if not args.enable_save:
      return
      os.makedirs(PATHS["data_phase"], exist_ok=True)
      with open(FILE_PATH, "w", encoding="utf-8", newline="") as file_obj:
      writer = csv.DictWriter(
      file_obj,
      fieldnames=OPENVERSE_FIELDS,
      dialect="unix",
      )
      writer.writeheader()
      for row in data:
      writer.writerow(row)
    • def write_data(args, data_metrics, data_units):
      if not args.enable_save:
      return args
      # Create data directory for this phase
      os.makedirs(PATHS["data_phase"], exist_ok=True)
      with open(FILE_1_METRICS, "w", encoding="utf-8", newline="\n") as file_obj:
      writer = csv.DictWriter(
      file_obj, fieldnames=HEADER_1_METRICS, dialect="unix"
      )
      writer.writeheader()
      for row in data_metrics:
      writer.writerow(row)
      with open(FILE_2_UNITS, "w", encoding="utf-8", newline="\n") as file_obj:
      writer = csv.DictWriter(
      file_obj, fieldnames=HEADER_2_UNITS, dialect="unix"
      )
      writer.writeheader()
      for row in data_units:
      writer.writerow(row)
      return args
    • def write_data(args, tool_data):
      if not args.enable_save:
      return args
      LOGGER.info("Saving fetched data")
      os.makedirs(PATHS["data_phase"], exist_ok=True)
      with open(FILE_LANGUAGES, "w", encoding="utf-8", newline="\n") as file_obj:
      writer = csv.DictWriter(
      file_obj, fieldnames=HEADER_LANGUAGES, dialect="unix"
      )
      writer.writeheader()
      for row in tool_data:
      writer.writerow(row)
      return args

Description

  1. Add rows_to_csv() function to shared library (shared.py)
    • New function should check args.enable_save
    • New function should "Create data directory for this phase"
    • New function shoudn't return args
      • None of the curernt functions that return args modify args--there's no reason to return it
    • GCS fetch script only rights a single row, but it can send a list with a single row
    • Update fetch scripts to use new function
    • Test fetch scripts to verify they behave as intended
  2. Rename data_to_csv() function to dataframe_to_csv()
    • Update process scripts to use new name

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    Backlog

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /