Updating Columns Based on Another Column and Deleting Data from the Other
In this article, we’ll explore how to update column values based on another column in pandas. We’ll focus on two scenarios: updating one column with values from another while simultaneously deleting data from the other where conditions are met.
Background
Pandas is a powerful library for data manipulation and analysis in Python. It provides various tools for handling datasets, including data cleaning, filtering, grouping, merging, reshaping, and pivoting data. In this article, we’ll delve into two specific scenarios:
- Updating column values based on another column’s conditions.
- Deleting data from one column where a certain condition is met.
Scenario 1: Updating Column Values Based on Another Column
Suppose you have a DataFrame (df) with three columns: fname, lname, and role. The mobile and Pmobile columns contain phone numbers, which are either numeric or formatted as strings. You want to update the values in the mobile column based on the corresponding values in the Pmobile column.
The condition for updating is that a value should start with '08', '8', or '+' followed by 353’. Once updated, you also want to delete any matching rows from the Pmobile` column.
Example Code
Here’s an example code snippet using pandas:
import pandas as pd
# Create a sample DataFrame
data = {'fname': ['Anky', 'Anky', 'Tom', 'Harry', 'Harry', 'Harry'],
'lname': ['sur1', 'sur1', 'sur2', 'sur3', 'sur3', 'sur3'],
'role': ['', 'abc', 'def', 'ghi', '', 'ijk'],
'mobile': [None, None, None, None, None, None],
'Pmobile': ['08511663451212', '+4471123456', '0851166346', '', None, None]}
df = pd.DataFrame(data)
# Replace empty strings with NaN
df.replace('', np.nan, inplace=True)
print("Original DataFrame:")
print(df)
# Update mobile values based on Pmobile conditions
df.loc[df['Pmobile'].str.startswith(('08','8','+353'), na=False), ['mobile', 'Pmobile']] = df[['mobile', 'Pmobile']].shift(-1, axis=1)
df
# Delete matching rows from Pmobile column
df['Pmobile'] = df['Pmobile'].apply(lambda x: np.nan if x else None)
print("\nUpdated DataFrame:")
print(df)
Explanation
In this example:
- We create a sample DataFrame (
df) withfname,lname, androlecolumns. - We replace empty strings in the
mobilecolumn withNaN. - We update the
mobilevalues based on conditions from thePmobilecolumn using the.loc[]method. The.str.startswith()function checks if a value starts with'08','8', or'+' followed by353’. When a match is found, we shift the corresponding columns (mobileandPmobile) to the left by one position using the.shift(-1, axis=1)` method. - Finally, we delete any matching rows from the
Pmobilecolumn by replacing non-NaNvalues withNone.
Scenario 2: Deleting Data from One Column Based on Conditions
Suppose you have a DataFrame (df) with two columns: name and age. You want to delete all rows where the value in the age column is less than or equal to 18.
Here’s an example code snippet using pandas:
import pandas as pd
# Create a sample DataFrame
data = {'name': ['John', 'Mary', 'Bob', 'Alice', 'Charlie'],
'age': [20, 22, 15, 25, 17]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Delete rows where age is less than or equal to 18
df = df[df['age'] > 18]
print("\nUpdated DataFrame:")
print(df)
Explanation
In this example:
- We create a sample DataFrame (
df) withnameandagecolumns. - We print the original DataFrame.
- We delete all rows where the value in the
agecolumn is less than or equal to 18 using boolean indexing. The resulting updated DataFrame contains only rows whereageis greater than 18.
By applying these techniques, you can effectively update column values based on conditions and delete data from one column while maintaining the integrity of your dataset.
Conclusion
Updating column values based on another column’s conditions and deleting data from one column are common tasks in data manipulation. By utilizing pandas’ powerful features such as boolean indexing, the .loc[] method, and the .str.startswith() function, you can efficiently manage your datasets and extract insights from the data.
Last modified on 2023-06-05