Compare two CSV files and search for similar items

python csv compare

The answer by srgerg is terribly inefficient, as it operates in quadratic time; here is a linear time solution instead, using Python 2.6-compatible syntax:

import csvwith open('masterlist.csv', 'rb') as master:    master_indices = dict((r[1], i) for i, r in enumerate(csv.reader(master)))with open('hosts.csv', 'rb') as hosts:    with open('results.csv', 'wb') as results:            reader = csv.reader(hosts)        writer = csv.writer(results)        writer.writerow(next(reader, []) + ['RESULTS'])        for row in reader:            index = master_indices.get(row[3])            if index is not None:                message = 'FOUND in master list (row {})'.format(index)            else:                message = 'NOT FOUND in master list'            writer.writerow(row + [message])

This produces a dictionary, mapping signatures from masterlist.csv to a line number first. Lookups in a dictionary take constant time, making the second loop over hosts.csv rows independant from the number of rows in masterlist.csv. Not to mention code that's a lot simpler.

For those using Python 3, the above only needs to have the open() calls adjusted to open in text mode (remove the b from the file mode), and you want to add new line='' so the CSV reader can take control of line separators. You may want to state the encoding to use explicitly rather than rely on your system default (use encoding=...). The master_indices mapping can be built with a dictionary comprehension ({r[1]: i for i, r in enumerate(csv.reader(master))}).

python csv compare

Edit: While my solution works correctly, check out Martijn's answer below for a more efficient solution.

You can find the documentation for the python CSV module here.

What you're looking for is something like this:

import csvf1 = file('hosts.csv', 'r')f2 = file('masterlist.csv', 'r')f3 = file('results.csv', 'w')c1 = csv.reader(f1)c2 = csv.reader(f2)c3 = csv.writer(f3)masterlist = list(c2)for hosts_row in c1:    row = 1    found = False    for master_row in masterlist:        results_row = hosts_row        if hosts_row[3] == master_row[1]:            results_row.append('FOUND in master list (row ' + str(row) + ')')            found = True            break        row = row + 1    if not found:        results_row.append('NOT FOUND in master list')    c3.writerow(results_row)f1.close()f2.close()f3.close()

python csv compare

Python's CSV and collections module, specifically OrderedDict, are really helpful here. You want to use OrderedDict to preserve the order of the keys, etc. You don't have to, but it's useful!

import csvfrom collections import OrderedDictsignature_row_map = OrderedDict()with open('hosts.csv') as file_object:    for line in csv.DictReader(file_object, delimiter='\t'):        signature_row_map[line['Signature']] = {'line': line, 'found_at': None}with open('masterlist.csv') as file_object:    for i, line in enumerate(csv.DictReader(file_object, delimiter='\t'), 1):        if line['Signature'] in signature_row_map:            signature_row_map[line['Signature']]['found_at'] = iwith open('newhosts.csv', 'w') as file_object:    fieldnames = ['Path', 'Filename', 'Size', 'Signature', 'RESULTS']    writer = csv.DictWriter(file_object, fieldnames, delimiter='\t')    writer.writer.writerow(fieldnames)    for signature_info in signature_row_map.itervalues():        result = '{0} FOUND in masterlist {1}'        # explicit check for sentinel        if signature_info['found_at'] is not None:            result = result.format('', '(row %s)' % signature_info['found_at'])        else:            result = result.format('NOT', '')        payload = signature_info['line']        payload['RESULTS'] = result        writer.writerow(payload)

Here's the output using your test CSV files:

Path    Filename        Size    Signature       RESULTSC:\     a.txt   14kb    012345  NOT FOUND in masterlist D:\     b.txt   99kb    678910   FOUND in masterlist (row 1)C:\     c.txt   44kb    111213   FOUND in masterlist (row 2)

Please excuse the misalignment, they are tab separated :)

CodeHunter

Compare two CSV files and search for similar items

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last