Description

Pandas Schema

data = [[1, 'S8', 1000], [2, 'G4', 800], [3, 'iPhone', 1400]]
product = pd.DataFrame(data, columns=['product_id', 'product_name', 'unit_price']).astype({'product_id':'Int64', 'product_name':'object', 'unit_price':'Int64'})
data = [[1, 1, 1, '2019-01-21', 2, 2000], [1, 2, 2, '2019-02-17', 1, 800], [2, 2, 3, '2019-06-02', 1, 800], [3, 3, 4, '2019-05-13', 2, 2800]]
sales = pd.DataFrame(data, columns=['seller_id', 'product_id', 'buyer_id', 'sale_date', 'quantity', 'price']).astype({'seller_id':'Int64', 'product_id':'Int64', 'buyer_id':'Int64', 'sale_date':'datetime64[ns]', 'quantity':'Int64', 'price':'Int64'})


Table: Product

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| product_id   | int     |
| product_name | varchar |
| unit_price   | int     |
+--------------+---------+
product_id is the primary key (column with unique values) of this table.
Each row of this table indicates the name and the price of each product.

Table: Sales

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| seller_id   | int     |
| product_id  | int     |
| buyer_id    | int     |
| sale_date   | date    |
| quantity    | int     |
| price       | int     |
+-------------+---------+
This table can have duplicate rows.
product_id is a foreign key (reference column) to the Product table.
Each row of this table contains some information about one sale.

Write a solution to report the products that were only sold in the first quarter of 2019. That is, between 2019-01-01 and 2019-03-31 inclusive.

Return the result table in any order.

The result format is in the following example.

Example 1:

  • Input:
      Product table:
      +------------+--------------+------------+
      | product_id | product_name | unit_price |
      +------------+--------------+------------+
      | 1          | S8           | 1000       |
      | 2          | G4           | 800        |
      | 3          | iPhone       | 1400       |
      +------------+--------------+------------+
      Sales table:
      +-----------+------------+----------+------------+----------+-------+
      | seller_id | product_id | buyer_id | sale_date  | quantity | price |
      +-----------+------------+----------+------------+----------+-------+
      | 1         | 1          | 1        | 2019-01-21 | 2        | 2000  |
      | 1         | 2          | 2        | 2019-02-17 | 1        | 800   |
      | 2         | 2          | 3        | 2019-06-02 | 1        | 800   |
      | 3         | 3          | 4        | 2019-05-13 | 2        | 2800  |
      +-----------+------------+----------+------------+----------+-------+
      
  • Output:
      +-------------+--------------+
      | product_id  | product_name |
      +-------------+--------------+
      | 1           | S8           |
      +-------------+--------------+
      
  • Explanation:
    The product with id 1 was only sold in the spring of 2019.
    The product with id 2 was sold in the spring of 2019 but was also sold after the spring of 2019.
    The product with id 3 was sold after spring 2019.
    We return only product 1 as it is the product that was only sold in the spring of 2019.

Submitted Code

import pandas as pd

def sales_analysis(product: pd.DataFrame, sales: pd.DataFrame) -> pd.DataFrame:
    # 1. product와 sales를 product_id 열 기준으로 병합
    merged = product.merge(sales, on='product_id')

    # 2. sale_date를 datetime 타입으로 변환
    merged['sale_date'] = pd.to_datetime(merged['sale_date'])

    # 3. 상품별 최초 판매일 / 마지막 판매일 계산
    period = merged.groupby('product_id')['sale_date'].agg(min_date='min', max_date='max')

    # 4. 판매 기간이 전부 2019년 1분기 안에 있는 상품만 선택
    valid_products = period[
        (period['min_date'] >= '2019-01-01') & (period['max_date'] <= '2019-03-31')
    ].index

    # 5. 상품 테이블에서 해당 상품만 리포트
    result = product.loc[ product['product_id'].isin(valid_products), ['product_id', 'product_name'] ]

    return result

Runtime: 348 ms | Beats 79.47%
Memory: 70.42 MB | Beats 18.98%

두 테이블을 먼저 merge하는 방법은 효율이 떨어졌다. Sales 테이블에서 먼저 product_id를 기준으로 groupby()하고 sale_date가 조건에 맞는 상품만 찾은 뒤, 해당 상품만 Product에서 찾는 방법이 훨씬 빠르다.

Other Solutions

1st

def sales_analysis(product: pd.DataFrame, sales: pd.DataFrame) -> pd.DataFrame:
    # Group the 'sales' DataFrame by 'product_id' and calculate the minimum and maximum sale dates for each product
    sales = sales.groupby('product_id')['sale_date'].agg(['min', 'max']).reset_index()
    
    # Filter the sales data to include only records with sale dates between January 1, 2019, and March 31, 2019
    sales = sales[(sales['min'] >= '2019-01-01') & (sales['max'] <= '2019-03-31')]
    
    # Merge the filtered sales data with the 'product' DataFrame based on 'product_id', keeping only 'product_id' and 'product_name' columns
    result = pd.merge(sales, product, on='product_id', how='inner')[['product_id', 'product_name']]
    
    # Return the resulting DataFrame
    return result

날짜를 datetime으로 바꾸지 않아도 리트코드의 테스트 데이터가 YYYY-MM-DD으로 깨끗하게 통일되어 있기 때문에 문제없이 통과되는 것 같다.

2nd

SELECT p.product_id ,p.product_name
FROM product p
JOIN sales s
    ON p.product_id=s.product_id
GROUP BY p.product_id,p.product_name
HAVING MIN(s.sale_date)>= '2019-01-01' AND MAX(s.sale_date)<= '2019-03-31'

GROUP BY + HAVING을 사용한 쿼리다.

Leave a comment