0

Summary

I'm trying to save some data I have in a CSV to a SQLite database using SQLAlchemy in a Jupyter Notebook. When trying to flush or merge the session data for the junction tables, I receive an IntegrityError saying that there is a duplicated row. I do not believe this to be true as I checked the data source and found no duplicates and checked each row before adding them and also found no duplicates. Thus, I am confused about why this error is happening.

Problem

First, the CSV are imported into dataframes like:

recipe_df = pd.read_csv("csv/recipes.csv").replace({np.nan: None})
material_df = pd.read_csv("csv/materials.csv").replace({np.nan: None})
crystal_df = pd.read_csv("csv/crystals.csv").replace({np.nan: None})

and the ORM table classes look like:

class Recipe(Base):
 __tablename__ = "recipe"
 id = Column(Integer, primary_key=True, autoincrement=True)
 name = Column(String)
 
 # ...
 # Relationships
 materials = relationship("Material", secondary="recipe_to_material", back_populates="recipes")
 crystals = relationship("Crystal", secondary="recipe_to_crystal", back_populates="recipes")
class Material(Base):
 __tablename__ = "material"
 id = Column(Integer, primary_key=True, autoincrement=True)
 name = Column(String)
 icon = Column(String)
 is_recipe = Column(Boolean)
 # Relationships
 recipes = relationship("Recipe", secondary="recipe_to_material", back_populates="materials")
class Crystal(Base):
 __tablename__ = "crystal"
 id = Column(Integer, primary_key=True, autoincrement=True)
 name = Column(String)
 icon = Column(String)
 # Relationships
 recipes = relationship("Recipe", secondary="recipe_to_crystal", back_populates="crystals")
class RecipeToMaterial(Base):
 __tablename__ = "recipe_to_material"
 recipe_id = Column(Integer, ForeignKey("recipe.id"), primary_key=True)
 material_id = Column(Integer, ForeignKey("material.id"), primary_key=True)
 amount = Column(Integer)
class RecipeToCrystal(Base):
 __tablename__ = "recipe_to_crystal"
 recipe_id = Column(Integer, ForeignKey("recipe.id"), primary_key=True)
 crystal_id = Column(Integer, ForeignKey("crystal.id"), primary_key=True)
 amount = Column(Integer)
class MaterialHasRecipe(Base):
 __tablename__ = "material_has_recipe"
 material_id = Column(Integer, ForeignKey("material.id"), primary_key=True)
 recipe_id = Column(Integer, ForeignKey("recipe.id"), primary_key=True)

There is no trouble exporting to the recipe, material, and crystal tables, however, trouble arises when trying to do the same for the junction tables. Essentially, recipe_df contains both the data necessary for the recipe table alongside the names of the materials and crystals it requires. The junction table rows should be built by taking the id of the recipe and the id of the material/crystal that matches the name in each column of that recipe like:

def add_recipe_crystals():
 recipe_ids:list[int] = []
 crystal_ids:list[int] = []
 amounts:list[int] = []
 # Get all recipe and crystal ids
 for _, row in recipe_df.iterrows():
 recipe_id = session.query(Recipe.id).filter(Recipe.name == row["name"]).first()[0]
 for i in range(1, 3):
 crystal_name = row[f"crys_{i}_name"]
 crystal_id = session.query(Crystal.id).filter(Crystal.name == crystal_name).first()
 # if crystal_i doesn't exist, just skip this one
 if crystal_id is None:
 continue
 crystal_id = crystal_id[0]
 
 recipe_ids.append(recipe_id)
 crystal_ids.append(crystal_id)
 amounts.append(int(row[f"crys_{i}_count"]))
 recipe_to_crystal_df = pd.DataFrame({
 "recipe_id": recipe_ids,
 "crystal_id": crystal_ids,
 "amount": amounts,
 })
 # Create RecipeToCrystal instances
 for i, row in recipe_to_crystal_df.iterrows():
 rec_2_crys = RecipeToCrystal(
 recipe_id = row["recipe_id"],
 crystal_id = row["crystal_id"],
 amount = row["amount"]
 )
 session.add(rec_2_crys)

This appears to work perfectly fine until time comes to try and flush or merge the session. When that happens, I'm given an error like:

IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: recipe_to_crystal.recipe_id, recipe_to_crystal.crystal_id
[SQL: INSERT INTO recipe_to_crystal (recipe_id, crystal_id, amount) VALUES (?, ?, ?)]

Attempted Fixes

Thinking this might be an autoflush problem, I wrapped the functions to add to the junction tables in the function add_junction_tables() and wrapped that in a session.no_autoflush context manager like:

with session.no_autoflush:
 add_junction_tables()
 session.commit()

I also checked recipe_df for rows where the crystals had the same name like:

count = 0
for i, row in recipe_df.iterrows():
 crys_1 = row["crys_1_name"]
 crys_2 = row["crys_2_name"]
 if crys_2 is None:
 continue
 if crys_1 in crys_2:
 count += 1
 print(f"In recipe {row["name"]}, the crystals {crys_1} and {crys_2} are the same")
if count == 0:
 print("No duplicates found")

This printed "No duplicates found"

I also made a checking function like:

def recipe_crystal_exists(rec_id:int, crys_id:int):
 existing_entry = session.query(RecipeToCrystal).filter(
 RecipeToCrystal.recipe_id == rec_id,
 RecipeToCrystal.crystal_id == crys_id
 ).first()
 if existing_entry:
 return True
 
 return False

and added a check in add_recipe_crystals() before adding the ids to their respective lists like:

if recipe_crystal_exists(recipe_id, crystal_id):
 raise ValueError(f"Recipe {recipe_id} and crystal {crystal_id} already exists in RecipeToCrystals")

This error was never raised.

I also swapped out session.add() for session.merge() , expunged the session, and tried deleting the database altogether and trying again.

Conclusion

The error still persists, however, and I am left stumped. The weirdest thing is that add_recipe_materials(), which is functionally identical and runs before add_recipe_crystals() does not seem to have this problem. Otherwise, I would assume the error would be raised for it first.

What is causing this error and/or is there any way to see what rows specifically are the duplicates to better diagnose this problem? Any and all help would be appreciated.

asked Jan 3 at 6:09
2
  • 1
    It seems you are testing duplicate by checking if crys_1_name and crys_1_name are same or not in "a row". But Instead you should check if there are duplicates "among" the rows. If there are two rows with same name and same crys_1_name and/or crys_1_name you should see duplicate error as expected. Commented Jan 3 at 18:42
  • @SyedNakibHossain Ah, you're right. There was exactly one repeated row. It seems that while cleaning the data and making the CSV, I made an error when getting what was supposed to be two separate items by the name they include. The latter's name was a composite of the former's, so it actually just got the first name twice. Commented Jan 4 at 3:25

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.