How to Append Data from Selenium to a Pandas DataFrame Without Overwriting Existing Values

Working with Pandas DataFrames in a For Loop: A Deep Dive into Append Operations

In this article, we will explore the intricacies of working with pandas DataFrames in a for loop, specifically focusing on append operations. We will delve into the reasons behind the failure to append a dictionary fetched from Selenium and provide an example solution.

Introduction

Pandas is a powerful library used for data manipulation and analysis in Python. Its DataFrame data structure is particularly useful for storing and manipulating tabular data. In this article, we will discuss how to work with pandas DataFrames in a for loop, focusing on the append operation. We will also explore why appending a dictionary fetched from Selenium fails and provide an example solution.

Understanding Pandas Append Operation

The append method in pandas is used to add new rows to a DataFrame. It takes another DataFrame or list of dictionaries as input and returns a new DataFrame with the appended data.

# Create two DataFrames
import pandas as pd

df1 = pd.DataFrame({'Name': ['John', 'Anna'], 'Age': [28, 24]})
df2 = pd.DataFrame({'Name': ['Peter', 'Linda'], 'Age': [35, 32]})

# Append df2 to df1
new_df = pd.concat([df1, df2], ignore_index=True)
print(new_df)

Output:

NameAge
John28
Anna24
Peter35
Linda32

Why Append Fails with Selenium Fetched Data

When working with a for loop and appending data fetched from Selenium, it is essential to understand why the append operation fails. The reason lies in how pandas handles duplicate values.

By default, pd.DataFrame.append() ignores the index of the original DataFrame and appends the new data to the end of the original DataFrame. However, if there are duplicate values in the original DataFrame, pandas will replace them with the new values from the appended DataFrame.

In this case, we have a CSV file containing manual entries (the first 8 rows) and an empty film_list DataFrame. When we append data fetched from Selenium to the film_list, it appears as if the append operation is failing because some duplicate links are being added to the end of the DataFrame.

Solution: Using the ‘a’ Mode

The solution lies in changing the mode when appending to the CSV file. Instead of using the default mode (w), we use the 'a' mode, which stands for “append.” This allows pandas to append new data without overwriting existing values.

Here is an updated example:

# Create a DataFrame from the CSV file
film_list = pd.read_csv("film_list.csv", index_col=0)

# Initialize the Chrome driver and navigate to the website
driver = webdriver.Chrome("D:\Documents\ADAM\Project\CSFD Bot\chromedriver.exe")
driver.get("https://www.csfd.cz/zebricky/nejlepsi-filmy/?show=complete")

# Find all elements on the page that contain 'film' in their href attribute
elems = driver.find_elements_by_xpath("//a[contains(@href, 'film')]")

# Loop through each element and extract its link
for elem in elems:
    scraped_link = elem.get_attribute("href")
    
    # Append the new data to the DataFrame using the 'a' mode
    film_list_updated = pd.DataFrame([{"link": scraped_link}], columns=film_list.columns)
    film_list = pd.concat([film_list, film_list_updated], ignore_index=True)

# Save the updated DataFrame back to the CSV file
film_list.to_csv("film_list.csv", mode='a', header=False)

# Close the Chrome driver
driver.quit()

Output:

link
https://www.csfd.cz/film/231260-star-wars-klon
https://www.csfd.cz/film/820012-drsny-mesto/pr
https://www.csfd.cz/film/902757-damsky-gambit/
https://www.csfd.cz/film/622365-the-mandaloria
https://www.csfd.cz/film/281929-borat-subseque
https://www.csfd.cz/film/818525-delete-history
https://www.csfd.cz/film/4952-kocar-do-vidne/p
https://www.csfd.cz/film/823303-last-and-first
https://www.csfd.cz/film/43582-posledni-samuraj/
https://www.csfd.cz/film/43582-posledni-samuraj/
https://www.csfd.cz/film/43582-posledni-samuraj/

By using the 'a' mode, we ensure that pandas appends new data without overwriting existing values. This solution allows us to append multiple rows of data fetched from Selenium without issues.

Conclusion

Working with pandas DataFrames in a for loop and appending data can be challenging, especially when dealing with duplicate values. By understanding how pandas handles append operations and using the correct mode ( 'a' ), we can successfully append new data to our DataFrame. In this article, we explored the reasons behind the failure of append operations when working with Selenium-fetched data and provided an example solution.


Last modified on 2024-08-09