일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | ||||
4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 |
18 | 19 | 20 | 21 | 22 | 23 | 24 |
25 | 26 | 27 | 28 | 29 | 30 | 31 |
- 결정트리
- 파이썬
- 딥러닝
- CNN
- 오류역전파
- 생명정보학
- Kaggle
- ncbi
- 바이오인포매틱스
- BLaST
- 이항분포
- COVID
- 인공지능 수학
- 블록체인
- 행렬
- 바이오파이썬
- 인공지능
- SVM
- AP
- 생물정보학
- 캐글
- AP Computer Science A
- Java
- 서열정렬
- 시그모이드
- 자바
- 인공신경망
- MERS
- bioinformatics
- 알파폴드
- Today
- Total
데이터 과학
PANDAS 연습 - 데이터 경시대회에서 우승하는 예제 본문
일일 판매 데이터를 시계열로 분석하는 예시로 이를 학습함으로 인해 PANDAS의 시각화를 학습할 수 있습니다.
matplotlib와 seabon을 활용하는 방법은 이전 블로그에 있으며 Kaggle에 있는 이 예제를 통해 PANDAS 라이브러리 쓰임새에 대해 학습할 수 있을 것입니다.
데이터 소스는 다음 링크에 있습니다. 용량이 크지 않기에 쉽게 다운 받아서 압축을 풀어서 주피터 노트북 폴더에 넣은 후 예제를 실행하면 됩니다.
https://www.kaggle.com/competitions/competitive-data-science-predict-future-sales/data
Predict Future Sales | Kaggle
www.kaggle.com
실습할 예제는 Power of PANDAS 라는 주제로 작성된 소스인데 링크는 다음과 같습니다.
https://www.kaggle.com/code/gehallak/the-power-of-pandas
The power of Pandas
Explore and run machine learning code with Kaggle Notebooks | Using data from Predict Future Sales
www.kaggle.com
두 군데 링크를 걸었으면 위에 데이터 압축을 푼 후에 주피터 노트북 메인이나 폴더를 하나 만들어서 업로드하면 됩니다.
위 프로그램을 실행한 파일을 그대로 두었으니 보면서 실습을 해 보도록 해요.
In this Notebook, we will show how far we can go with Pandas. In this time series prediction competition, we won't use ARIMA or LSTM. Only Pandas and some basic arithmetics.
1 | Looking at the data, understanding the competition.
First, we load train and test DataFrames
import pandas as pd
train =pd.read_csv('./sales_train.csv')
test = pd.read_csv('./test.csv')
We will have a look at train first:
print('Number of rows and columns:',train.shape)
print('')
print ('Columns :')
print(train.dtypes)
print('')
print ('First rows:')
print(train.head(5))
Number of rows and columns: (2935849, 6)
Columns :
date object
date_block_num int64
shop_id int64
item_id int64
item_price float64
item_cnt_day float64
dtype: object
First rows:
date date_block_num shop_id item_id item_price item_cnt_day
0 02.01.2013 0 59 22154 999.00 1.0
1 03.01.2013 0 25 2552 899.00 1.0
2 05.01.2013 0 25 2552 899.00 -1.0
3 06.01.2013 0 25 2554 1709.05 1.0
4 15.01.2013 0 25 2555 1099.00 1.0
There are only 6 columns, but a lot of rows, almost 3 million. Each row represents a triplet : date, shop_id, item_id. Each row indicates for a certain day, a certain shop and a certain item, how many items were sold and what was the price.
print(train['shop_id'].nunique())
print(train['item_id'].nunique())
60
21807
There are 60 shops and 21807 items in the training set.
print(train['date'].nunique())
print(train['date_block_num'].nunique())
1034
34
There are 1034 days in the training set. They are grouped into months indexed by date_block_num. There are 34 months (0 to 33)
The target of the competition is to predict for a list of pairs (shop, item) the number of items sold during the 35th month (November 2015)
print(test.shape)
print(test.dtypes)
(214200, 3)
ID int64
shop_id int64
item_id int64
dtype: object
The test file indicates all the pairs (shop, item) for which we need to predict the number of items sold during the 35th month. First let's compare the pairs (shop, item) in the test file and in the training file.
print(test['shop_id'].nunique())
print(test['item_id'].nunique())
42
5100
We can see that there are less shops and less items in the test file than in the train file. Let's try to understand why.
2 | Evolution of items and shops
As the target of the competiton is a prediction of the monthly sales, we regroup the sales by date_block_num and sum other all the days within a month.
train_monthly=train.groupby(['date_block_num','shop_id','item_id']).agg({'item_cnt_day':'sum'})
train_monthly.columns=['item_cnt_month']
train_monthly=train_monthly.reset_index()
print(train_monthly.shape)
print(train_monthly.head(3))
(1609124, 4)
date_block_num shop_id item_id item_cnt_month
0 0 0 32 6.0
1 0 0 33 3.0
2 0 0 35 1.0
We could think that the number of rows would be 30 times less after regrouping in months. But it only dropped from 3 million to 1.6 million. This probably indicates that there are a lot of pairs (shop, item) that appear only once a month.
We will first analyse the evolution of the number of shops and items during the 34 months in the training file.
import matplotlib.pyplot as plt
shop_item_history=train_monthly.groupby(['date_block_num']).agg({'shop_id':'nunique','item_id':'nunique'})
plt.rcParams["figure.figsize"] = (20,5)
fig, axes = plt.subplots(nrows=1, ncols=2)
shop_item_history.plot(y='shop_id',label='number of shops',ax=axes[0])
shop_item_history.plot(y='item_id',label='number of items',ax=axes[1])
plt.show()
We can see that the number of shops and the number of items have been dropping during the 34 months of the training data.
We also notice that there are a total of 21807 items in the training set, but the graph shows that there are never more than 8500 active at any given moment. This indicates a lot of rotations in the items sold, maybe due to new versions coming all the time as the company sells Software.
print(shop_item_history.tail())
print('There are {} shops and {} items in the test data'.format(test['shop_id'].nunique(),test['item_id'].nunique()) )
shop_id item_id
date_block_num
29 43 5216
30 43 5323
31 42 5108
32 43 5085
33 44 5413
There are 42 shops and 5100 items in the test data
the train data ends with 44 shops and 5413 items, while the test data has 42 shops and 5100 items. We can suspect that the business is not doing well, shops are closing and items are being removed.
We need to check if new items have been introduced or new shops openend in the test month. This would be a problem because we would have nothing in the train data to help us predict that.
shops_train = list(train_monthly['shop_id'].unique())
shops_test = list(test['shop_id'].unique())
items_train = list(train_monthly['item_id'].unique())
items_test = list(test['item_id'].unique())
print('All shops in the test file are in the train file : {}'.format(set(shops_test).issubset(set(shops_train))))
print('All items in the test file are in the train file : {}'.format(set(items_test).issubset(set(items_train))))
All shops in the test file are in the train file : True
All items in the test file are in the train file : False
Some items have been newly introduced in the test file. We need to deal with them. Let's see how many new items do we have.
new_items = set(items_test).difference(items_train)
print ('There are {} new items out of {} in the test file. This is {:.1f}%'.format(len(new_items),len(items_test),len(new_items)/len(items_test)*100.0))
There are 363 new items out of 5100 in the test file. This is 7.1%
3 |(Item, shop) pairs
Let's now check the pairs (shop_id, item_id). In the test file, there are 42 shops and 5100 items. This makes 42x5100 = 214200 possible pairs. And there are 214200 lines in the test file. So nothing is missing. The target of the competition is to predict the sales of all the items for all the shops in the test file. But an important question is, how many of these pairs are present in the train file? We have already determined that 100% of the shops in the test are in the train and that 92.9% of the items in the test are in the train. But what about the pairs ?
train_mean = train_monthly.groupby(['shop_id','item_id']).agg({'item_cnt_month':'mean'}).reset_index()
print ('There are {} pairs (shop_id,item_id) in the train file'.format(len(train_mean)))
There are 424124 pairs (shop_id,item_id) in the train file
shop_item_train = train_mean[['shop_id','item_id']].apply(tuple, axis=1).tolist()
shop_item_test = test[['shop_id','item_id']].apply(tuple, axis=1).tolist()
new_shop_item = set(shop_item_test).difference(shop_item_train)
print ('There are {} new pairs (shop,item) out of {} in the test file. This is {:.1f}%'.format(len(new_shop_item),len(shop_item_test),len(new_shop_item)/len(shop_item_test)*100.0))
There are 102796 new pairs (shop,item) out of 214200 in the test file. This is 48.0%
These are bad news. 48% of the pairs we have to predict have never occured in the train file. When a pair (shop, item) is in the test file but not in the train file, does it mean that the item has never been sold in the shop during the training period? Theoretically, yes. Can we then safely predict to zero a pair we have never seen? The submission files will then have a lot of zeroes (at least 48%). This is probably not the best strategy, but we will try it first and see how it goes. We can refine it later.
Now enough of these bad news and let's come back to something easier. How did the total number of items sold in all the shops evolved during the training period ?
4 | Business evolution
train_total_monthly=train_monthly.groupby(['date_block_num']).agg({'item_cnt_month':'sum'}).reset_index()
train_total_monthly.plot(y='item_cnt_month')
plt.show()
This curve is screaming non-stationarity. There is seasonality (12 months cycle) and downtrend. Because the training time series is so short (only 34 time steps), we will try to deal manually with the trend and the seasonality, without using ARIMA or LSTM.
5 | Prediction
We will look for the fitting line to the curve.
from scipy.optimize import curve_fit
def f(x, A, B): # 'straight line' y=f(x)
return A*x + B
popt, pcov = curve_fit(f, train_total_monthly['date_block_num'],train_total_monthly['item_cnt_month']) # data x, y to fit
print ('The fitting line equation is: item_cnt_month = {:.0f} x date_block_num + {:.0f}'.format(popt[0], popt[1]))
The fitting line equation is: item_cnt_month = -1944 x date_block_num + 139368
plt.plot(train_total_monthly['date_block_num'], train_total_monthly['item_cnt_month'],label='item_cnt_month')
# Plot another line on the same chart/graph
plt.plot(train_total_monthly['date_block_num'], f(train_total_monthly['date_block_num'],popt[0],popt[1]),label='fitting line')
plt.legend(loc="upper right")
plt.show()
print('Based on extrapolation, we expect a total sales count of {:.0f} for the test month'.format(f(34,popt[0],popt[1])))
Based on extrapolation, we expect a total sales count of 73288 for the test month
Now, what about the seasonality ? The month to predict is November. Let's look at the ratio November/October that we have in our training data. date_block_num=0 for Jan13. So Oct13 is 9 and Nov13 is 10. Oct14 is 21 and Nov14 is 22. Oct15 is 33 and Nov15 is 34. 34 is the month we have to predict.
print ('Nov13 / Oct13 is {:.4f}'.format(train_total_monthly.iloc[10]['item_cnt_month']/train_total_monthly.iloc[9]['item_cnt_month']))
print ('Nov14 / Oct14 is {:.4f}'.format(train_total_monthly.iloc[22]['item_cnt_month']/train_total_monthly.iloc[21]['item_cnt_month']))
Nov13 / Oct13 is 1.0194
Nov14 / Oct14 is 1.0970
In November, the sales are higher than in October. In average : November/October = 1.0582. This ratio represents the trend and seasonality altogether. We see that the seasonality is stronger than the trend here, as November is higher than October, while the general trend is down. December is even much stronger, but here we just have to focus on November.
print('Based on extrapolation and seasonality, we expect a total sales count of {:.0f} for the test month'.format(1.0582*train_total_monthly.iloc[33]['item_cnt_month']))
Based on extrapolation and seasonality, we expect a total sales count of 75191 for the test month
So now what should we do with all of this ? we can try a very simple idea. We generated above the Dataframe train_mean. This is, for each pair (shop, item) in the train set, the average value of monthly item sold during the training period.
print(train_mean.head(5))
print ('')
print('train_mean number of rows and columns:', train_mean.shape)
shop_id item_id item_cnt_month
0 0 30 31.0
1 0 31 11.0
2 0 32 8.0
3 0 33 3.0
4 0 35 7.5
train_mean number of rows and columns: (424124, 3)
We want to add the column item_cnt_month from train_mean to the test DataFrame, when we find the pairs. so we will merge test and train_mean based on the pair (shop, item) but retain all the rows in test, as this is what we have to submit.
test=pd.merge(test,train_mean,on=['shop_id','item_id'],how='left')
print (test.head(10))
ID shop_id item_id item_cnt_month
0 0 5 5037 1.444444
1 1 5 5320 NaN
2 2 5 5233 2.000000
3 3 5 5232 1.000000
4 4 5 5268 NaN
5 5 5 5039 1.833333
6 6 5 5041 2.500000
7 7 5 5046 2.000000
8 8 5 5319 5.727273
9 9 5 5003 NaN
We can see a lot of NaN. 48% exactly as we have already calculated. Let's check this.
print('there are {} NaN. That is {:.1f}% of the rows.'.format(test['item_cnt_month'].isnull().sum(),test['item_cnt_month'].isnull().sum()/len(test)*100.0))
there are 102796 NaN. That is 48.0% of the rows.
Yahoo ! Always good to check that there is no mistakes. We are going to fill the NaN with zeroes and clip(0,20) as instructed.
test.fillna(0,inplace=True)
test['item_cnt_month'] = test['item_cnt_month'].clip(0, 20)
Now we want to see what is the total number of items sold in the test DataFrame.
print('There are {:.0f} item sold in total in the test Dataframe.'.format(test['item_cnt_month'].sum()))
There are 202192 item sold in total in the test Dataframe.
Using this method, we get 202k items sold during Nov15, while the average monthly items sold during the training period is roughly 110k. There is probably a selection process going on. The items which were not selling have been removed and we end up selecting the pairs (shop, item) to get the maximum sales.
202k is way too high as we are expecting a total of 75k based on trend and seasonality. So let's adjust these values proportionally.
test['item_cnt_month']= test['item_cnt_month'] * 75191.0 / 202192.0
6 | Submission
Let's try to submit these numbers!
submission=test[['ID','item_cnt_month']]
submission.to_csv('submission.csv', index=False)
RMSE = 1.1413. Not bad, with only a few pandas manipulations and arithmetics!
to further improve the score we would need to look at all the new pairs (item, shop) that appeared in the test file and were not in the train file. We predicted zero for them but we need somehow to connect them to the train file using the categories. If you liked this Notebook, please upvote.
https://www.kaggle.com/code/faressayah/data-science-best-practices-with-pandas-part-1
📊Data Science BEST Practices📋with Pandas||PART 1
Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources
www.kaggle.com
https://www.kaggle.com/code/faressayah/data-science-best-practices-with-pandas-part-2
📊Data Science BEST Practices📋with Pandas||PART 2
Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources
www.kaggle.com
'Kaggle 데이터 분석, 딥러닝' 카테고리의 다른 글
Kaggle에서 MNIST (0) | 2022.09.18 |
---|---|
Kaggle에서 딥러닝 시작 (0) | 2022.08.28 |
시각화 seaborn (0) | 2022.05.31 |
신용카드 채무 불이행 예측 모델 (0) | 2022.05.30 |
캐글, 상태 추출 예제 (0) | 2022.05.10 |