Notice

Recent Posts

Recent Comments

Link

Tags more

Archives

Today

Total

관리 메뉴

데이터 과학

PANDAS 연습 - 데이터 경시대회에서 우승하는 예제 본문

Kaggle 데이터 분석, 딥러닝

PANDAS 연습 - 데이터 경시대회에서 우승하는 예제

티에스윤 2022. 6. 7. 14:26

일일 판매 데이터를 시계열로 분석하는 예시로 이를 학습함으로 인해 PANDAS의 시각화를 학습할 수 있습니다.

matplotlib와 seabon을 활용하는 방법은 이전 블로그에 있으며 Kaggle에 있는 이 예제를 통해 PANDAS 라이브러리 쓰임새에 대해 학습할 수 있을 것입니다.

데이터 소스는 다음 링크에 있습니다. 용량이 크지 않기에 쉽게 다운 받아서 압축을 풀어서 주피터 노트북 폴더에 넣은 후 예제를 실행하면 됩니다.

https://www.kaggle.com/competitions/competitive-data-science-predict-future-sales/data

Predict Future Sales | Kaggle

www.kaggle.com

실습할 예제는 Power of PANDAS 라는 주제로 작성된 소스인데 링크는 다음과 같습니다.

https://www.kaggle.com/code/gehallak/the-power-of-pandas

The power of Pandas

Explore and run machine learning code with Kaggle Notebooks | Using data from Predict Future Sales

www.kaggle.com

두 군데 링크를 걸었으면 위에 데이터 압축을 푼 후에 주피터 노트북 메인이나 폴더를 하나 만들어서 업로드하면 됩니다.

위 프로그램을 실행한 파일을 그대로 두었으니 보면서 실습을 해 보도록 해요.

소스 출처: https://www.kaggle.com/code/gehallak/the-power-of-pandas

The power of Pandas

In this Notebook, we will show how far we can go with Pandas. In this time series prediction competition, we won't use ARIMA or LSTM. Only Pandas and some basic arithmetics.

1 | Looking at the data, understanding the competition.

First, we load train and test DataFrames

import pandas as pd
train =pd.read_csv('./sales_train.csv') 
test = pd.read_csv('./test.csv')

We will have a look at train first:

print('Number of rows and columns:',train.shape) 
print('')
print ('Columns :')
print(train.dtypes)
print('')
print ('First rows:')
print(train.head(5))

Number of rows and columns: (2935849, 6)

Columns :
date               object
date_block_num      int64
shop_id             int64
item_id             int64
item_price        float64
item_cnt_day      float64
dtype: object

First rows:
         date  date_block_num  shop_id  item_id  item_price  item_cnt_day
0  02.01.2013               0       59    22154      999.00           1.0
1  03.01.2013               0       25     2552      899.00           1.0
2  05.01.2013               0       25     2552      899.00          -1.0
3  06.01.2013               0       25     2554     1709.05           1.0
4  15.01.2013               0       25     2555     1099.00           1.0

There are only 6 columns, but a lot of rows, almost 3 million. Each row represents a triplet : date, shop_id, item_id. Each row indicates for a certain day, a certain shop and a certain item, how many items were sold and what was the price.

print(train['shop_id'].nunique())
print(train['item_id'].nunique())

60
21807

There are 60 shops and 21807 items in the training set.

print(train['date'].nunique())
print(train['date_block_num'].nunique())

1034
34

There are 1034 days in the training set. They are grouped into months indexed by date_block_num. There are 34 months (0 to 33)

The target of the competition is to predict for a list of pairs (shop, item) the number of items sold during the 35th month (November 2015)

print(test.shape) 
print(test.dtypes)

(214200, 3)
ID         int64
shop_id    int64
item_id    int64
dtype: object

The test file indicates all the pairs (shop, item) for which we need to predict the number of items sold during the 35th month. First let's compare the pairs (shop, item) in the test file and in the training file.

print(test['shop_id'].nunique())
print(test['item_id'].nunique())

42
5100

We can see that there are less shops and less items in the test file than in the train file. Let's try to understand why.

2 | Evolution of items and shops

As the target of the competiton is a prediction of the monthly sales, we regroup the sales by date_block_num and sum other all the days within a month.

train_monthly=train.groupby(['date_block_num','shop_id','item_id']).agg({'item_cnt_day':'sum'})
train_monthly.columns=['item_cnt_month']
train_monthly=train_monthly.reset_index()
print(train_monthly.shape)
print(train_monthly.head(3))

(1609124, 4)
   date_block_num  shop_id  item_id  item_cnt_month
0               0        0       32             6.0
1               0        0       33             3.0
2               0        0       35             1.0

We could think that the number of rows would be 30 times less after regrouping in months. But it only dropped from 3 million to 1.6 million. This probably indicates that there are a lot of pairs (shop, item) that appear only once a month.

We will first analyse the evolution of the number of shops and items during the 34 months in the training file.

import matplotlib.pyplot as plt
shop_item_history=train_monthly.groupby(['date_block_num']).agg({'shop_id':'nunique','item_id':'nunique'})
plt.rcParams["figure.figsize"] = (20,5)
fig, axes = plt.subplots(nrows=1, ncols=2)
shop_item_history.plot(y='shop_id',label='number of shops',ax=axes[0])
shop_item_history.plot(y='item_id',label='number of items',ax=axes[1])
plt.show()

We can see that the number of shops and the number of items have been dropping during the 34 months of the training data.

We also notice that there are a total of 21807 items in the training set, but the graph shows that there are never more than 8500 active at any given moment. This indicates a lot of rotations in the items sold, maybe due to new versions coming all the time as the company sells Software.

print(shop_item_history.tail())
print('There are {} shops and {} items in the test data'.format(test['shop_id'].nunique(),test['item_id'].nunique()) )

                shop_id  item_id
date_block_num                  
29                   43     5216
30                   43     5323
31                   42     5108
32                   43     5085
33                   44     5413
There are 42 shops and 5100 items in the test data

the train data ends with 44 shops and 5413 items, while the test data has 42 shops and 5100 items. We can suspect that the business is not doing well, shops are closing and items are being removed.

We need to check if new items have been introduced or new shops openend in the test month. This would be a problem because we would have nothing in the train data to help us predict that.

shops_train = list(train_monthly['shop_id'].unique())
shops_test = list(test['shop_id'].unique())
items_train = list(train_monthly['item_id'].unique())
items_test = list(test['item_id'].unique())

print('All shops in the test file are in the train file : {}'.format(set(shops_test).issubset(set(shops_train))))
print('All items in the test file are in the train file : {}'.format(set(items_test).issubset(set(items_train))))

All shops in the test file are in the train file : True
All items in the test file are in the train file : False

Some items have been newly introduced in the test file. We need to deal with them. Let's see how many new items do we have.

new_items = set(items_test).difference(items_train)
print ('There are {} new items out of {} in the test file. This is {:.1f}%'.format(len(new_items),len(items_test),len(new_items)/len(items_test)*100.0))

There are 363 new items out of 5100 in the test file. This is 7.1%

3 |(Item, shop) pairs

Let's now check the pairs (shop_id, item_id). In the test file, there are 42 shops and 5100 items. This makes 42x5100 = 214200 possible pairs. And there are 214200 lines in the test file. So nothing is missing. The target of the competition is to predict the sales of all the items for all the shops in the test file. But an important question is, how many of these pairs are present in the train file? We have already determined that 100% of the shops in the test are in the train and that 92.9% of the items in the test are in the train. But what about the pairs ?

train_mean = train_monthly.groupby(['shop_id','item_id']).agg({'item_cnt_month':'mean'}).reset_index()
print ('There are {} pairs (shop_id,item_id) in the train file'.format(len(train_mean)))

There are 424124 pairs (shop_id,item_id) in the train file

shop_item_train = train_mean[['shop_id','item_id']].apply(tuple, axis=1).tolist()
shop_item_test = test[['shop_id','item_id']].apply(tuple, axis=1).tolist()
new_shop_item = set(shop_item_test).difference(shop_item_train)

print ('There are {} new pairs (shop,item) out of {} in the test file. This is {:.1f}%'.format(len(new_shop_item),len(shop_item_test),len(new_shop_item)/len(shop_item_test)*100.0))

There are 102796 new pairs (shop,item) out of 214200 in the test file. This is 48.0%

These are bad news. 48% of the pairs we have to predict have never occured in the train file. When a pair (shop, item) is in the test file but not in the train file, does it mean that the item has never been sold in the shop during the training period? Theoretically, yes. Can we then safely predict to zero a pair we have never seen? The submission files will then have a lot of zeroes (at least 48%). This is probably not the best strategy, but we will try it first and see how it goes. We can refine it later.

Now enough of these bad news and let's come back to something easier. How did the total number of items sold in all the shops evolved during the training period ?

4 | Business evolution

train_total_monthly=train_monthly.groupby(['date_block_num']).agg({'item_cnt_month':'sum'}).reset_index()
train_total_monthly.plot(y='item_cnt_month')
plt.show()

This curve is screaming non-stationarity. There is seasonality (12 months cycle) and downtrend. Because the training time series is so short (only 34 time steps), we will try to deal manually with the trend and the seasonality, without using ARIMA or LSTM.

5 | Prediction

We will look for the fitting line to the curve.

from scipy.optimize import curve_fit

def f(x, A, B): #  'straight line' y=f(x)
    return A*x + B

popt, pcov = curve_fit(f, train_total_monthly['date_block_num'],train_total_monthly['item_cnt_month']) #  data x, y to fit
print ('The fitting line equation is: item_cnt_month = {:.0f} x date_block_num + {:.0f}'.format(popt[0], popt[1]))

The fitting line equation is: item_cnt_month = -1944 x date_block_num + 139368

plt.plot(train_total_monthly['date_block_num'], train_total_monthly['item_cnt_month'],label='item_cnt_month')

# Plot another line on the same chart/graph
plt.plot(train_total_monthly['date_block_num'], f(train_total_monthly['date_block_num'],popt[0],popt[1]),label='fitting line')
plt.legend(loc="upper right")
plt.show()

print('Based on extrapolation, we expect a total sales count of {:.0f} for the test month'.format(f(34,popt[0],popt[1])))

Based on extrapolation, we expect a total sales count of 73288 for the test month

Now, what about the seasonality ? The month to predict is November. Let's look at the ratio November/October that we have in our training data. date_block_num=0 for Jan13. So Oct13 is 9 and Nov13 is 10. Oct14 is 21 and Nov14 is 22. Oct15 is 33 and Nov15 is 34. 34 is the month we have to predict.

print ('Nov13 / Oct13 is {:.4f}'.format(train_total_monthly.iloc[10]['item_cnt_month']/train_total_monthly.iloc[9]['item_cnt_month']))
print ('Nov14 / Oct14 is {:.4f}'.format(train_total_monthly.iloc[22]['item_cnt_month']/train_total_monthly.iloc[21]['item_cnt_month']))

Nov13 / Oct13 is 1.0194
Nov14 / Oct14 is 1.0970

In November, the sales are higher than in October. In average : November/October = 1.0582. This ratio represents the trend and seasonality altogether. We see that the seasonality is stronger than the trend here, as November is higher than October, while the general trend is down. December is even much stronger, but here we just have to focus on November.

print('Based on extrapolation and seasonality, we expect a total sales count of {:.0f} for the test month'.format(1.0582*train_total_monthly.iloc[33]['item_cnt_month']))

Based on extrapolation and seasonality, we expect a total sales count of 75191 for the test month

So now what should we do with all of this ? we can try a very simple idea. We generated above the Dataframe train_mean. This is, for each pair (shop, item) in the train set, the average value of monthly item sold during the training period.

print(train_mean.head(5))
print ('')
print('train_mean number of rows and columns:', train_mean.shape)

   shop_id  item_id  item_cnt_month
0        0       30            31.0
1        0       31            11.0
2        0       32             8.0
3        0       33             3.0
4        0       35             7.5

train_mean number of rows and columns: (424124, 3)

We want to add the column item_cnt_month from train_mean to the test DataFrame, when we find the pairs. so we will merge test and train_mean based on the pair (shop, item) but retain all the rows in test, as this is what we have to submit.

test=pd.merge(test,train_mean,on=['shop_id','item_id'],how='left')
print (test.head(10))

   ID  shop_id  item_id  item_cnt_month
0   0        5     5037        1.444444
1   1        5     5320             NaN
2   2        5     5233        2.000000
3   3        5     5232        1.000000
4   4        5     5268             NaN
5   5        5     5039        1.833333
6   6        5     5041        2.500000
7   7        5     5046        2.000000
8   8        5     5319        5.727273
9   9        5     5003             NaN

We can see a lot of NaN. 48% exactly as we have already calculated. Let's check this.

print('there are {} NaN. That is {:.1f}% of the rows.'.format(test['item_cnt_month'].isnull().sum(),test['item_cnt_month'].isnull().sum()/len(test)*100.0))

there are 102796 NaN. That is 48.0% of the rows.

Yahoo ! Always good to check that there is no mistakes. We are going to fill the NaN with zeroes and clip(0,20) as instructed.

test.fillna(0,inplace=True)
test['item_cnt_month'] = test['item_cnt_month'].clip(0, 20)

Now we want to see what is the total number of items sold in the test DataFrame.

print('There are {:.0f} item sold in total in the test Dataframe.'.format(test['item_cnt_month'].sum()))

There are 202192 item sold in total in the test Dataframe.

Using this method, we get 202k items sold during Nov15, while the average monthly items sold during the training period is roughly 110k. There is probably a selection process going on. The items which were not selling have been removed and we end up selecting the pairs (shop, item) to get the maximum sales.

202k is way too high as we are expecting a total of 75k based on trend and seasonality. So let's adjust these values proportionally.

test['item_cnt_month']= test['item_cnt_month'] * 75191.0 / 202192.0

6 | Submission

Let's try to submit these numbers!

submission=test[['ID','item_cnt_month']]

submission.to_csv('submission.csv', index=False)

RMSE = 1.1413. Not bad, with only a few pandas manipulations and arithmetics!

to further improve the score we would need to look at all the new pairs (item, shop) that appeared in the test file and were not in the train file. We predicted zero for them but we need somehow to connect them to the train file using the categories. If you liked this Notebook, please upvote.

https://www.kaggle.com/code/faressayah/data-science-best-practices-with-pandas-part-1

📊Data Science BEST Practices📋with Pandas||PART 1

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

https://www.kaggle.com/code/faressayah/data-science-best-practices-with-pandas-part-2

📊Data Science BEST Practices📋with Pandas||PART 2

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

저작자표시 비영리 동일조건 (새창열림)

'Kaggle 데이터 분석, 딥러닝' 카테고리의 다른 글

Kaggle에서 MNIST (0)	2022.09.18
Kaggle에서 딥러닝 시작 (0)	2022.08.28
시각화 seaborn (0)	2022.05.31
신용카드 채무 불이행 예측 모델 (0)	2022.05.30
캐글, 상태 추출 예제 (0)	2022.05.10

'Kaggle 데이터 분석, 딥러닝' Related Articles

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

데이터 과학

PANDAS 연습 - 데이터 경시대회에서 우승하는 예제 본문

PANDAS 연습 - 데이터 경시대회에서 우승하는 예제

1 | Looking at the data, understanding the competition.

2 | Evolution of items and shops

3 |(Item, shop) pairs

4 | Business evolution

5 | Prediction

6 | Submission

'Kaggle 데이터 분석, 딥러닝' 카테고리의 다른 글

티스토리툴바