Python. Pandas. BigData. Messy TSV file. How to wrangle the data? Python. Pandas. BigData. Messy TSV file. How to wrangle the data? numpy numpy

Python. Pandas. BigData. Messy TSV file. How to wrangle the data?


Suppose you had TSV data such as this:

status=A    protocol=B  region_name=C   datetime=D  user_ip=E   user_agent=F    user_id=Guser_id=G   status=A    region_name=C   user_ip=E   datetime=D  user_agent=F    protocol=Bprotocol=B      datetime=D  status=A    user_ip=E   user_agent=F    user_id=G

The order of the fields may be scambled, and there may be missing values. However, you don't have to drop rows just because the fields don't appear in a certain order. You can use the field names provided in the row data itself to place the values in the correct columns. For example,

import pandas as pddf = pd.read_table('data/data.tsv', sep='\t+',header=None, engine='python')df = df.stack().str.extract(r'([^=]*)=(.*)', expand=True).dropna(axis=0)df.columns = ['field', 'value']df = df.set_index('field', append=True)df.index = df.index.droplevel(level=1)df = df['value'].unstack(level=1)print(df)

yields

field datetime protocol region_name status user_agent user_id user_ipindex                                                                0            D        B           C      A          F       G       E1            D        B           C      A          F       G       E2            D        B        None      A          F       G       E

To handle a large TSV file, you could process rows in chunks, and then concatenate the processed chunks into one DataFrame at the end:

import pandas as pdchunksize =     # the number of rows to be processed per iterationdfs = []reader = pd.read_table('data/data.tsv', sep='\t+',header=None, engine='python',                       iterator=True, chunksize=chunksize)for df in reader:    df = df.stack().str.extract(r'([^=]*)=(.*)', expand=True).dropna(axis=0)    df.columns = ['field', 'value']    df.index.names = ['index', 'num']    df = df.set_index('field', append=True)    df.index = df.index.droplevel(level='num')    df = df['value'].unstack(level=1)    dfs.append(df)df = pd.concat(dfs, ignore_index=True)print(df)

Explanation: Given df:

In [527]: df = pd.DataFrame({0: ['status=A', 'user_id=G', 'protocol=B'], 1: ['protocol=B', 'status=A', 'datetime=D'], 2: ['region_name=C', 'region_name=C', 'status=A'], 3: ['datetime=D', 'user_ip=E', 'user_ip=E'], 4: ['user_ip=E', 'datetime=D', 'user_agent=F'], 5: ['user_agent=F', 'user_agent=F', 'user_id=G'], 6: ['user_id=G', 'protocol=B', None]}); df   .....:    .....:    .....:    .....:    .....:    .....:    .....: Out[527]:             0           1              2           3             4             5           60    status=A  protocol=B  region_name=C  datetime=D     user_ip=E  user_agent=F   user_id=G1   user_id=G    status=A  region_name=C   user_ip=E    datetime=D  user_agent=F  protocol=B2  protocol=B  datetime=D       status=A   user_ip=E  user_agent=F     user_id=G        None

you can coalesce all the values into a single column

In [449]: df.stack()Out[449]: 0  0         status=A   1       protocol=B   2    region_name=C   3       datetime=D   4        user_ip=E   5     user_agent=F   6        user_id=G1  0        user_id=G   1         status=A   2    region_name=C   3        user_ip=E   4       datetime=D   5     user_agent=F   6       protocol=B2  0       protocol=B   1       datetime=D   2         status=A   3        user_ip=E   4     user_agent=F   5        user_id=Gdtype: object

and then apply .str.extract(r'([^=]*)=(.*)') to separate the field name from the value:

In [450]: df = df.stack().str.extract(r'([^=]*)=(.*)', expand=True).dropna(axis=0); dfOut[450]:                0  10 0       status  A  1     protocol  B  2  region_name  C  3     datetime  D  4      user_ip  E  5   user_agent  F  6      user_id  G1 0      user_id  G  1       status  A  2  region_name  C  3      user_ip  E  4     datetime  D  5   user_agent  F  6     protocol  B2 0     protocol  B  1     datetime  D  2       status  A  3      user_ip  E  4   user_agent  F  5      user_id  G

To make it easier to reference parts of the DataFrame, let's give the columns and index levels descriptive names:

In [530]: df.columns = ['field', 'value']; df.index.names = ['index', 'num']; dfOut[530]:                  field valueindex num                   0     0         status     A      1       protocol     B...

Now if we move the field column into the index:

In [531]: df = df.set_index('field', append=True); dfOut[531]:                       valueindex num field            0     0   status          A      1   protocol        B      2   region_name     C      3   datetime        D...

and drop the num index level:

In [532]: df.index = df.index.droplevel(level='num'); dfOut[532]:                   valueindex field            0     status          A      protocol        B      region_name     C      datetime        D... 

then we can obtain a DataFrame of the desired form by moving the field index level into the column index:

In [533]: df = df['value'].unstack(level=1); dfOut[533]: field datetime protocol region_name status user_agent user_id user_ipindex                                                                0            D        B           C      A          F       G       E1            D        B           C      A          F       G       E2            D        B        None      A          F       G       E


You can use Pandas' vectorized string operations, specifically str.contains:

import numpy as np# boolean index of rows to keepis_valid = np.ones(Clean_data.shape[0], np.bool)for column in Clean_Data.columns:    # check whether rows contain this column name    is_valid &= Clean_Data[column].str.contains(column)# drop rows where not all elements contain their respective column namesClean_Data.drop(np.where(~is_valid)[0], inplace=True)


I can't add comments, so I'll post this as a reply (In reality it is a comment in response to your comment about memory usage and runtime).

One thing that you need to consider, for large files (100GB), is that you are not going to read these files into memory. You can set the chunk size for pandas “Large data” work flows using pandas or How to read a 6 GB csv file with pandas, or use the yield generator with the csv module and read the files row/line by line. Reading a huge .csv in python

Incorporating @unutbu's comment about using regex to sort the entries into columns, given that the fieldnames are so clearly demarcated for each cell (i.e. r'(.*)=(.*)' is all that is required - although there may be some error correction needed) should be all that you need (also, as they say, dropping entire rows because of some missing data is not a typical, or recommended, approach).