Pandas scraped data not working in pandas Pandas scraped data not working in pandas selenium selenium

Pandas scraped data not working in pandas


I think you need change:

df1.WE=np.where(df3.AL.isin(df1.EW),df1.WE,np.nan)

to

df1.WE=np.where(df1.EW.isin(df2.AL),df1.WE,np.nan)

Problem is different length of DataFrame with real data. So need change data from df1 with another data - comapring return maks with same length as df1 and no error.

With your data:

df1 = pd.read_csv('df1.csv', names=['a','b','c'])print (df1.head())                                           a     b  \0             Ponte Preta U20 v Cruzeiro U20  2.10   1  Fluminense RJ U20 v Defensor Sporting U20  2.00   2              Gremio RS U20 v Palmeiras U20  3.30   3                       Barcelona v Sporting  1.33   4                        Bayern Munich v PSG  2.40                                                      c  0  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  1  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  2  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  3  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  4  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  

df2 = pd.read_csv('df2.csv', names=['a','b','c', 'd', 'e'])print (df2.head())                 a                    b                  c     d  \0          In-Play      CSKA Moscow U19        Man Utd U19  1.14   1          In-Play  Atletico Madrid U19        Chelsea U19  1.01   2          In-Play         Juventus U19     Olympiakos U19  1.40   3  Starting in 22'       Paris St-G U19  Bayern Munich U19  2.24   4      Today 21:00         Man City U19       Shakhtar U19  2.66                                                      e  0  https://www.betfair.com.au/exchange/plus/footb...  1  https://www.betfair.com.au/exchange/plus/footb...  2  https://www.betfair.com.au/exchange/plus/footb...  3  https://www.betfair.com.au/exchange/plus/footb...  4  https://www.betfair.com.au/exchange/plus/footb...  

comapre numeric columns, here b and d:

df1.b=np.where(df1.b.isin(df2.d),df1.b,np.nan)#first 5 values is NaNsprint (df1.head())                                           a   b  \0             Ponte Preta U20 v Cruzeiro U20 NaN   1  Fluminense RJ U20 v Defensor Sporting U20 NaN   2              Gremio RS U20 v Palmeiras U20 NaN   3                       Barcelona v Sporting NaN   4                        Bayern Munich v PSG NaN  

                                                   c  0  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  1  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  2  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  3  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  4  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  #check if some not NaNs values in b columnprint (df1[df1.b.notnull()])                                       a      b  \23                Swindon v Forest Green   1.40   50       Sportivo Barracas v Canuelas FC  13.00   80                              FC Nitra   1.53   81                                   0-0   1.40   83       Cape Town City v Maritzburg Utd   1.53   84         Mamelodi Sundowns v Baroka FC   3.75   90  Dorking Wanderers v Tonbridge Angels   1.53   95             Coalville Town v Stamford   1.40                                                       c  23  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  50  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  80  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  81  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  83  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  84  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  90  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  95  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  

Also problem of your test data is there are same number of rows (4), so no errors.


On a side note, I'd recommend using pandas functions with pandas:

df1.loc[~df1.EW.isin(df2.AL), 'WE'] = np.nan


Ok, let's get back to the drawing board. The code above is cleaner, but does exactly the same you're doing with numpy. Lets split your code apart.

1) I highly recommend you to use jupyter / jupyter notebooks to play with the data and understand what is going on at each line. Take a look here, for example:https://gist.github.com/Casyfill/f432966ebabd93f4271e27a1e2e76579

So, your df1 has 100 rows and 3 columns. your df2 has 42 rows and 5 columns.

Now, you create df3 as an empty dataframe (0 rows) but 12 columns (by the way, perhaps you should use more explanatory column names). This step is totally fine, while you don't have to define all columns beforehand.

Lets go to the second line:df3['DAT'] = df2['AA']

here you basically copy the column from the second dataframe. Now, as we didn't have any rows in df3 before, it is totaly legitimate operation. By doing that, you create 42 rows in your df3. Again, this line by itself is fine.

now, last line. here the logic is the following: first, for each row in df3, we check if cell of df3.AL (its value) is in df1.EW column. Just note, that we never defined df3.AL before, so the whole column contains only NANs, therefore this by itself does not make any sense.

Next, let's assume there is something in df3.AL. as we check everything row-wise, we'll get a pd.Series (think - one column) of booleans as a result of this test, column with 42 rows. Now, we're trying to use this column as a "mask", which defines if df1.WE should be the same or defaulted to Nan. but you can't do that, because df1 has 100 rows, not 42!. Hense, we've got an error.

So you need to redefine what you're actually want to do here - it's not clear what you're actually need to do here.