ValueError: Shape of passed values is (2270, 2), indices imply (2270, 4)

Question:

Following a youtube video (https://www.youtube.com/watch?v=VtRLrQ3Ev-U) about training a model with dataset (https://colab.research.google.com/drive/1UxmeNX_MaIO0ni26cg9H6mtJcRFafWiR?usp=sharing#scrollTo=sOqdGfTza-Gl).

In this code:

X = df[df.columns[3:4]].values
y = df[df.columns[-1]].values

I used the positions "3:4" because is the only column with numeric value (in my own dataset).
After this, i execute the following code:

over = RandomOverSampler()
X, y = over.fit_resample(X, y)
data = np.hstack((X, np.reshape(y, (-1,1))))
transformed_df = pd.DataFrame(data, columns=df.columns)

But when I execute it in Colab, i receive this:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-222-eaff3b92a197> in <module>()
      1 data = np.hstack((X, np.reshape(y, (-1,1))))
----> 2 transformed_df = pd.DataFrame(data, columns=df.columns)

2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in _check_values_indices_shape_match(values, index, columns)
    391         passed = values.shape
    392         implied = (len(index), len(columns))
--> 393         raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
    394 
    395 

ValueError: Shape of passed values is (2270, 2), indices imply (2270, 4)

Somebody can help me to fix this?

THanks in advance!

Adding all the information!

Purpose of the IA, I give it a text and based on 1000 lines of my dataset, autodetect if I want to review it or not and classify with a category if its possible.

Dataset example:
Category|Sector|News|review

Malware|Other sector|Joker malware infects over 500 000 Huawei Android devices – More than 500,000 Huawei users have downloaded from the company’s official Android store applications infected with Joker malware that subscribes to premium mobile services.Researchers found ten seemingly harmless apps in AppGallery that contained code for connecting to malicious command and control server to receive configurations and additional components.Masked by functional appsA report from antivirus maker Doctor Web notes that the malicious apps retained their advertised functionality but downloaded components that subscribed users to premium mobile services.To keep users in the dark the infected apps requested access to notifications, which allowed them to intercept confirmation codes delivered over SMS by the subscription service.According to the researchers, the malware could subscribe a user to a maximum of five services, although the threat actor could modify this limitation at any time.The list of malicious applications included virtual keyboards, a camera app, a launcher, an online messenger, a sticker collection, coloring programs, and a game.Most of them came from one developer (Shanxi Kuailaipai Network Technology Co., Ltd.) and two from a different one. These ten apps were downloaded by more than 538,000 Huawei users, Doctor Web says.Doctor Web informed Huawei of these apps and the company removed them from AppGallery. While new users can no longer download them, those that already have the apps running on their devices need to run a manual cleanup. The table below lists the name name of the application and its package:Application name Package name Super Keyboard com.nova.superkeyboard Happy Colour com.colour.syuhgbvcff Fun Color com.funcolor.toucheffects New 2021 Keyboard com.newyear.onekeyboard Camera MX – Photo Video Camera com.sdkfj.uhbnji.dsfeff BeautyPlus Camera com.beautyplus.excetwa.camera Color RollingIcon com.hwcolor.jinbao.rollingicon Funney Meme Emoji com.meme.rouijhhkl Happy Tapping com.tap.tap.duedd All-in-One Messenger com.messenger.sjdoifoThe researchers say that the same modules downloaded by the infected apps in AppGallery were also present in other apps on Google Play, used by other versions of Joker malware. The full list of indicators of compromise is available here.Once active, the malware communicates to its remote server to get the configuration file, which contains a list of tasks, websites for premium services, JavaScript that mimics user interaction.Joker malware’s history goes as far back as 2017 and constantly found its way in apps distributed through Google Play store. In October 2019, Tatyana Shishkova, Android malware analyst at Kaspersky, tweeted about more than 70 compromised apps that had made it into the official store.And the reports about the malware in Google Play kept coming. In early 2020, Google announced that since 2017, it had removed about 1,700 apps infected with Joker.Last February, Joker was still present in the store and it continued to slip past Google’s defenses even in July last year.|0

Full code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from imblearn.over_sampling import RandomOverSampler

df = pd.read_csv('news_parsed_end.csv', delimiter='|', error_bad_lines=False)

X = df[df.columns[3:4]].values
y = df[df.columns[-1]].values

over = RandomOverSampler()
X, y = over.fit_resample(X, y)
data = np.hstack((X, np.reshape(y, (-1,1))))
transformed_df = pd.DataFrame(data, columns=df.columns)

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.6, random_state=1000)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.2, random_state=1000)

model = tf.keras.Sequential([
                             tf.keras.layers.Dense(16, activation='relu'), # if x <= 0 --> 0, x > 0 --> x
                             tf.keras.layers.Dense(16, activation='relu'),
                             tf.keras.layers.Dense(1, activation="sigmoid")
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.005),
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['accuracy'])

model.evaluate(X_train, y_train)
model.evaluate(X_valid, y_valid)

model.fit(X_train, y_train, batch_size=16, epochs=200, validation_data=(X_valid, y_valid))

model.evaluate(X_test, y_test)

Solution:

It worked thanks to @hpaulj !

I added the following:

data = np.hstack((X, np.reshape(y, (-1,1))))
data = data.reshape(1135,4)
data.shape

And now I can do this:

transformed_df = pd.DataFrame(data, columns=df.columns)
Asked By: MagiCs ito

||

Answers:

Here’s an example of what I think is happening Though I don’t know if you understand enough numpy and pandas to apply it to your case. Often people take some tutorial (or worse yet a video), and try to use their own data, without much understanding of what’s going on.

Anyways, lets make a 4 column frame:

In [133]: arr = np.arange(12).reshape(3,4); df = pd.DataFrame(arr, columns=['a','b','c','d'])

In [134]: df
Out[134]: 
   a  b   c   d
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

Now use hstack to combine two columns (df[['b','d']] would have worked just as well):

In [136]: x = np.hstack([df[['b']],df[['d']]])

In [137]: x
Out[137]: 
array([[ 1,  3],
       [ 5,  7],
       [ 9, 11]])

The key is that it is a 2 column array, shape (3,2)

If I try to make a frame from that as you do:

In [139]: pd.DataFrame(x, columns=df.columns)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [139], in <cell line: 1>()
----> 1 pd.DataFrame(x, columns=df.columns)

File ~anaconda3libsite-packagespandascoreframe.py:694, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    684         mgr = dict_to_mgr(
    685             # error: Item "ndarray" of "Union[ndarray, Series, Index]" has no
    686             # attribute "name"
   (...)
    691             typ=manager,
    692         )
    693     else:
--> 694         mgr = ndarray_to_mgr(
    695             data,
    696             index,
    697             columns,
    698             dtype=dtype,
    699             copy=copy,
    700             typ=manager,
    701         )
    703 # For data is list-like, or Iterable (will consume into list)
    704 elif is_list_like(data):

File ~anaconda3libsite-packagespandascoreinternalsconstruction.py:351, in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
    346 # _prep_ndarray ensures that values.ndim == 2 at this point
    347 index, columns = _get_axes(
    348     values.shape[0], values.shape[1], index=index, columns=columns
    349 )
--> 351 _check_values_indices_shape_match(values, index, columns)
    353 if typ == "array":
    355     if issubclass(values.dtype.type, str):

File ~anaconda3libsite-packagespandascoreinternalsconstruction.py:422, in _check_values_indices_shape_match(values, index, columns)
    420 passed = values.shape
    421 implied = (len(index), len(columns))
--> 422 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")

ValueError: Shape of passed values is (3, 2), indices imply (3, 4)

Note the same sort of error. There’s a mismatch between the x shape (3,2), and the shape implied by the 4 df.columns array.

If instead I select a subset of the columns, the same numbers as used for the hstack, it works:

In [140]: pd.DataFrame(x, columns=df.columns[[1,3]])
Out[140]: 
   b   d
0  1   3
1  5   7
2  9  11

2 columns, (n,2) data. It’s all about the array shapes. You won’t get far with pandas or numpy if you don’t pay attention to shapes.

Answered By: hpaulj
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.