Vectorizing to find current number of cars in Python – loop free

Question:

I wish to do this without a loop (for speed and learning how to).

In order to find out how many cars there currently is in the market, imagine you have sales numbers for all years from 1923 up until today.

This is the case for 5 different countries.

For all countries I also have 1 decay vector, since the cars at some point stop working. The vector is containing the number of cars that break down and are removed from the market at the given number of years after production.

It could look like this:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(100,1000,size=(100, 5)), columns=list('ABCDE'))

break_vector = np.random.dirichlet(np.ones(100),size=1)

The break_vector sums to 1, since we can assume that none of these cars survive more than 100 years on the road.

What we want to calculate is the number of REMAINING cars in each country year by year, so the calculation is basically for each market to:

Take the number of cars sold in 1923 from df and multiply with break_vector.
In 1924, we need to do the same. AND then add the number of cars from 1923 that are not broken down yet (so the sum of the first two places in break_vector times the number of cars sold in 1923). And so on.

Then we end up with a single dataframe that contains the number of cars still driving around in each country and for each year – the country are columns and the index is year.

What I want is a dataframe that contains the information on how many cars are driving around in country A in any given year.

I have done it in a loop, but how would I do it using vectors and hopefully getting both code that is easier to debug and also runs faster?

I tried to do it in a loop, and it works (which means it does what I described above). But at larger scale, it would be amazing to see how this could be done with vectors/matrices. And also how much faster it will be, if we are talking for example 50 countries.

import pandas as pd
import numpy as np
#Creating data
df = pd.DataFrame(np.random.randint(100,1000,size=(100, 5)), columns=list('ABCDE'))
break_vector = np.random.dirichlet(np.ones(100),size=1)
#Empty dataframe to store results
all_markets = pd.DataFrame()
#The countries to include
countries = list('ABCDE')
#Loopty loop stuff
for country in countries:
    zero_data = np.zeros(shape=(1000,1000))
    d = pd.DataFrame(zero_data)
    sold = df[country].cumsum()
    for sales in range(0,100):
        sales_year=df.iloc[sales,countries.index(country)]
        for breaks in range(0,100):
            breakdowns = sales_year*break_vector[0,breaks]
            d.iloc[breaks+sales,sales]=breakdowns
            print(breaks+sales)
    #Remaining in market - row sums
    result1 = sold-d.sum(axis=1).cumsum()[0:100]
    all_markets[country] = result1
ยดยดยด
Asked By: jkl841

||

Answers:

The point of my solution is that when you multiply the survival vector (break_vector) by the new cars you get all the cars that break every year as the sum of the diagonals in a matrix where the offset=0 is the present and the 99 is the oldest one:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randint(100,1000,size=(100, 5)), columns=list('ABCDE'))

age = np.arange(0,100)
survival = np.random.dirichlet(np.ones(100),1).T
broken_cars_matrix = survival * np.flip(df.A.values)
broken_cars = np.cumsum([np.trace(broken_cars_matrix, i) for i in reversed(age)])
remaining = np.cumsum(df.A) - broken_cars

You can visualize the solution using:

plt.plot(df.A)
plt.plot(np.cumsum(df.A))
plt.plot(broken_cars)
plt.plot(remaining)
plt.show()

The iteration per country I would do it with a loop, but could also vectorized using 3D matrices in the same way.

Answered By: Ziur Olpa
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.