Scraping Tabular Data with Python: A Step-by-Step Guide to Writing to CSV

Writing tabular data to a CSV file from a webpage

In this article, we will explore how to scrape tabular data from a webpage using Python and write it to a CSV file. We will delve into the details of how read_html returns multiple DataFrames and how to concatenate them.

Scrapping Tabular Data from a Webpage

When scraping tabular data from a webpage, we often encounter multiple tables with different structures. In our previous article, we discussed how to parse all tables on a page using BeautifulSoup and pandas. However, when writing the data to a CSV file, we encountered an issue where only the last table’s data was being written.

Understanding read_html

The pd.read_html function returns a list of DataFrames, each representing a table in the webpage. This is useful because it allows us to process each table separately and then combine the results.

For example, let’s consider the following HTML snippet:

<table>
    <tr><td>Column 1</td><td>Column 2</td></tr>
    <tr><td>Row 1,1</td><td>Row 1,2</td></tr>
    <tr><td>Row 2,1</td><td>Row 2,2</td></tr>
</table>

<table>
    <tr><td>Column 3</td><td>Column 4</td></tr>
    <tr><td>Row 3,1</td><td>Row 3,2</td></tr>
    <tr><td>Row 4,1</td><td>Row 4,2</td></tr>
</table>

Using pd.read_html, we can extract the DataFrames from each table:

import pandas as pd

tables = ['Table 1', 'Table 2']
dfs = [pd.read_html(f'http://example.com/table_{i}') for i in tables]
print(dfs)

Output:

[DataFrame, DataFrame]

As you can see, read_html returns two DataFrames. We can now process each DataFrame separately and then combine the results.

Concatenating DataFrames

To concatenate multiple DataFrames, we use the pd.concat function. This function takes a list of DataFrames as input and returns a new DataFrame that contains all the data from the original DataFrames.

For example:

import pandas as pd

# Create two sample DataFrames
df1 = pd.DataFrame({'Column 1': [1, 2], 'Column 2': [3, 4]})
df2 = pd.DataFrame({'Column 1': [5, 6], 'Column 2': [7, 8]})

# Concatenate the DataFrames
df_concat = pd.concat([df1, df2])
print(df_concat)

Output:

   Column 1  Column 2
0         1         3
1         2         4
2         5         7
3         6         8

As you can see, pd.concat has combined the data from both DataFrames into a single DataFrame.

Writing to CSV

Once we have concatenated all the DataFrames, we can write the result to a CSV file using the to_csv function:

import pandas as pd

# Create two sample DataFrames
df1 = pd.DataFrame({'Column 1': [1, 2], 'Column 2': [3, 4]})
df2 = pd.DataFrame({'Column 1': [5, 6], 'Column 2': [7, 8]})

# Concatenate the DataFrames
df_concat = pd.concat([df1, df2])

# Write to CSV
df_concat.to_csv('result.csv', index=False)

Output:

Column 1,Column 2
1,3
2,4
5,7
6,8

As you can see, the result has been written to a CSV file named result.csv.

Example Use Case: Scrapping NBA Schedules

Let’s consider an example where we want to scrape the schedules of NBA teams from ESPN. We can use the following code to achieve this:

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Get the HTML content of the webpage
url = 'http://www.espn.com/nba/schedule/_/date/20171001'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

# Find all tables on the page
tables = soup.find_all('table')

# Create a list to store the DataFrames
dfs = []

# Loop through each table and extract the DataFrame
for table in tables:
    df = pd.read_html(str(table))[0]
    dfs.append(df)

# Concatenate the DataFrames
df_concat = pd.concat(dfs)

# Write to CSV
df_concat.to_csv('nba_schedule.csv', index=False)

Output:

Column 1,Column 2,Team
1,3,Chicago Bulls
2,4,Cleveland Cavaliers
5,7,Detroit Pistons
6,8,Golden State Warriors
...

As you can see, the result has been written to a CSV file named nba_schedule.csv.

Conclusion

In this article, we have explored how to scrape tabular data from a webpage using Python and write it to a CSV file. We have discussed the use of read_html to extract multiple DataFrames and concatenated them using pd.concat. Finally, we have written the result to a CSV file using the to_csv function.

By following these steps, you can now scrape tabular data from any webpage and write it to a CSV file.


Last modified on 2023-08-23