182. Duplicate Emails

March 8, 2025

Description

Pandas Schema

data = [[1, 'a@b.com'], [2, 'c@d.com'], [3, 'a@b.com']]
person = pd.DataFrame(data, columns=['id', 'email']).astype({'id':'Int64', 'email':'object'})

Table: Person

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| id          | int     |
| email       | varchar |
+-------------+---------+

id is the primary key (column with unique values) for this table.
Each row of this table contains an email. The emails will not contain uppercase letters.

Write a solution to report all the duplicate emails. Note that it’s guaranteed that the email field is not NULL.
Return the result table in any order.
The result format is in the following example.

Example 1:

Input:

 
  Person table:
  +----+---------+
  | id | email   |
  +----+---------+
  | 1  | a@b.com |
  | 2  | c@d.com |
  | 3  | a@b.com |
  +----+---------+

Output:

  +---------+
  | Email   |
  +---------+
  | a@b.com |
  +---------+

Explanation: a@b.com is repeated two times.

Submitted Code

import pandas as pd

def duplicate_emails(person: pd.DataFrame) -> pd.DataFrame:
    # email 행 이름 변경
    person.rename(columns={"email": "Email"}, inplace=True)

    # 고유한 email 제거 및 중복 email은 첫 번째만 남기고 제거
    email_filtered = person[person.duplicated(subset=["Email"], keep=False)].drop_duplicates(subset=["Email"], keep="first")

    # id 열 제거
    email_filtered.drop(columns=["id"], inplace=True)
    
    return email_filtered

Runtime: 431 ms | Beats 62.94%
Memory: 66.17 MB | Beats 88.33%

먼저 .duplicated()에 keep=False로 중복되지 않은 이메일만 False로 만들어서 없앤 후, .drop_duplicates()에 keep="first"로 중복 이메일의 첫 번째만 남기고 모두 지우는 방식으로 해결했다.

Other Solutions

1st

import pandas as pd

def duplicate_emails(person: pd.DataFrame) -> pd.DataFrame:
    """
    This function identifies duplicate email addresses in a given DataFrame.
    
    Parameters:
    - person: A pandas DataFrame containing at least one column named 'email'.

    Returns:
    - A DataFrame containing unique duplicate email addresses.
    """
    
    # Use `duplicated` to find rows where the 'email' value is duplicated.
    # `duplicated('email')` returns True for rows where 'email' has already appeared earlier in the DataFrame.
    # `loc` filters only those rows where `duplicated` evaluates to True.
    duplicate_rows = person.loc[person.duplicated('email') == True]
    
    # Select the 'email' column and drop duplicate email entries to ensure unique duplicates are listed.
    result = duplicate_rows[['email']].drop_duplicates(keep='first')
    
    return result

2nd

# 1
SELECT email from Person  
group by email              -- 이메일을 기준으로 그룹화
having count(email) > 1;    -- 그룹화된 데이터 중 이메일이 두 번 이상 나타나는 경우만 선택

# 2.
SELECT DISTINCT(p1.email) from Person p1, Person p2 -- 테이블을 자기 자신과 조인 / DISTINCT로 중복 제거
where p1.id <> p2.id AND p1.email = p2.email;       -- 자기 자신과 비교시 id가 다르게 조건 추가 후
                                                    -- 이메일이 같은 두 사람 찾기

# 3. 
SELECT DISTINCT(p1.email) from 
Person p1 JOIN Person p2 ON
p1.email = p2.email AND p1.id <> p2.id;

Share on

X Facebook Reddit LinkedIn

Jooyeun Seo

182. Duplicate Emails

Description

Submitted Code

Other Solutions

1st

2nd

Share on

Leave a comment

You may also enjoy

500. Keyboard Row

496. Next Greater Element I

495. Teemo Attacking

492. Construct the Rectangle