LGBMRanking Model "Found input variables with inconsistent numbers of samples"
Question:
X and Y = 44980
group = 3
Data is time series, X contain features + the item being ranked
Date
Item
Feature
9/27
1
1
9/27
2
1
9/27
3
1
9/28
1
0
9/28
2
0
9/28
3
0
y contains the rank of the item
Date
Rank
9/27
3
9/27
2
9/27
1
9/28
2
9/28
3
9/28
1
But when running LGBM Ranker on the following data, I get the following error
Traceback (most recent call last):
File "code.py", line 62, in <module>
score = cross_val_score(model, X=X, y=y,
File "sklearnmodel_selection_validation.py", line 515, in cross_val_score
cv_results = cross_validate(
File "sklearnmodel_selection_validation.py", line 252, in cross_validate
X, y, groups = indexable(X, y, groups)
File "sklearnutilsvalidation.py", line 429, in indexable
check_consistent_length(*result)
File "sklearnutilsvalidation.py", line 383, in check_consistent_length
raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [44980, 44980, 3]
Code:
paths_dict = {'1':'../../1.csv',
'2':'../../2.csv',
'3':'../../3.csv',}
def load_paths(paths_dict):
df = pd.DataFrame()
for key, value in paths_dict.items():
df[key] = pd.read_csv(value, index_col=0, parse_dates=True)['Close']
df = df.iloc[::-1]
return df
df = load_paths(paths_dict)
df = df.stack().reset_index()
df.columns = ['Date', 'Item', 'Target']
df['Item'] = df['Item'].astype('int')
df['Target'] = df.groupby('Date')['Target'].rank('dense', ascending=False).astype(int)
df.set_index('Date', inplace=True)
y = df['Target']
X = df.drop(['Target'], axis=1)
model = LGBMRanker(n_jobs=-1)
score = cross_val_score(model, X=X, y=y,
groups=X.groupby('Item'),
cv=TimeSeriesSplit(n_splits=24),
scoring=make_scorer(ndcg_score))
Answers:
sklearn doesnt support ranking models
X and Y = 44980
group = 3
Data is time series, X contain features + the item being ranked
Date | Item | Feature |
---|---|---|
9/27 | 1 | 1 |
9/27 | 2 | 1 |
9/27 | 3 | 1 |
9/28 | 1 | 0 |
9/28 | 2 | 0 |
9/28 | 3 | 0 |
y contains the rank of the item
Date | Rank |
---|---|
9/27 | 3 |
9/27 | 2 |
9/27 | 1 |
9/28 | 2 |
9/28 | 3 |
9/28 | 1 |
But when running LGBM Ranker on the following data, I get the following error
Traceback (most recent call last):
File "code.py", line 62, in <module>
score = cross_val_score(model, X=X, y=y,
File "sklearnmodel_selection_validation.py", line 515, in cross_val_score
cv_results = cross_validate(
File "sklearnmodel_selection_validation.py", line 252, in cross_validate
X, y, groups = indexable(X, y, groups)
File "sklearnutilsvalidation.py", line 429, in indexable
check_consistent_length(*result)
File "sklearnutilsvalidation.py", line 383, in check_consistent_length
raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [44980, 44980, 3]
Code:
paths_dict = {'1':'../../1.csv',
'2':'../../2.csv',
'3':'../../3.csv',}
def load_paths(paths_dict):
df = pd.DataFrame()
for key, value in paths_dict.items():
df[key] = pd.read_csv(value, index_col=0, parse_dates=True)['Close']
df = df.iloc[::-1]
return df
df = load_paths(paths_dict)
df = df.stack().reset_index()
df.columns = ['Date', 'Item', 'Target']
df['Item'] = df['Item'].astype('int')
df['Target'] = df.groupby('Date')['Target'].rank('dense', ascending=False).astype(int)
df.set_index('Date', inplace=True)
y = df['Target']
X = df.drop(['Target'], axis=1)
model = LGBMRanker(n_jobs=-1)
score = cross_val_score(model, X=X, y=y,
groups=X.groupby('Item'),
cv=TimeSeriesSplit(n_splits=24),
scoring=make_scorer(ndcg_score))
sklearn doesnt support ranking models