How to keep null values when writing to csv How to keep null values when writing to csv postgresql postgresql

How to keep null values when writing to csv


You have two options here: change the csv.writing quoting option in Python, or tell PostgreSQL to accept quoted strings as possible NULLs (requires PostgreSQL 9.4 or newer)

Python csv.writer() and quoting

On the Python side, you are telling the csv.writer() object to add quotes, because you configured it to use csv.QUOTE_NONNUMERIC:

Instructs writer objects to quote all non-numeric fields.

None values are non-numeric, so result in "" being written.

Switch to using csv.QUOTE_MINIMAL or csv.QUOTE_NONE:

csv.QUOTE_MINIMAL
Instructs writer objects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.

csv.QUOTE_NONE
Instructs writer objects to never quote fields. When the current delimiter occurs in output data it is preceded by the current escapechar character.

Since all you are writing is longitude and latitude values, you don't need any quoting here, there are no delimiters or quotecharacters present in your data.

With either option, the CSV output for None values is simple an empty string:

>>> import csv>>> from io import StringIO>>> def test_csv_writing(rows, quoting):...     outfile = StringIO()...     csv_writer = csv.writer(outfile, delimiter=',', quoting=quoting)...     csv_writer.writerows(rows)...     return outfile.getvalue()...>>> rows = [...     [42.313270000, -71.116240000],...     [42.377010000, -71.064770000],...     [None, None],... ]>>> print(test_csv_writing(rows, csv.QUOTE_NONNUMERIC))42.31327,-71.1162442.37701,-71.06477"","">>> print(test_csv_writing(rows, csv.QUOTE_MINIMAL))42.31327,-71.1162442.37701,-71.06477,>>> print(test_csv_writing(rows, csv.QUOTE_NONE))42.31327,-71.1162442.37701,-71.06477,

PostgreSQL 9.4 COPY FROM, NULL values and FORCE_NULL

As of PostgreSQL 9.4, you can also force PostgreSQL to accept quoted empty strings as NULLs, when you use the FORCE_NULL option. From the COPY FROM documentation:

FORCE_NULL

Match the specified columns' values against the null string, even if it has been quoted, and if a match is found set the value to NULL. In the default case where the null string is empty, this converts a quoted empty string into NULL. This option is allowed only in COPY FROM, and only when using CSV format.

Naming the columns in a FORCE_NULL option lets PostgreSQL accept both the empty column and "" as NULL values for those columns, e.g.:

COPY position (    lon,     lat) FROM "filename"WITH (    FORMAT csv,    NULL '',    DELIMITER ',',    FORCE_NULL(lon, lat));

at which point it doesn't matter anymore what quoting options you used on the Python side.

Other options to consider

For simple data transformation tasks from other databases, don't use Python

If you already querying databases to collate data to go into PostgreSQL, consider directly inserting into Postgres. If the data comes from other sources, using the foreign data wrapper (fdw) module lets you cut out the middle-man and directly pull data into PostgreSQL from other sources.

Numpy data? Consider using COPY FROM as binary, directly from Python

Numpy data can more efficiently be inserted via binary COPY FROM; the linked answer augments a numpy structured array with the required extra metadata and byte ordering, then efficiently creates a binary copy of the data and inserts it into PostgreSQL using COPY FROM STDIN WITH BINARY and the psycopg2.copy_expert() method. This neatly avoids number -> text -> number conversions.

Persisting data to handle large datasets in a pipeline?

Don't re-invent the data pipeline wheels. Consider using existing projects such as Apache Spark, which have already solved the efficiency problems. Spark lets you treat data as a structured stream, and includes the infrastructure to run data analysis steps in parallel, and you can treat distributed, structured data as Pandas dataframes.

Another option might be to look at Dask to help share datasets between distributed tasks to process large amounts of data.

Even if converting an already running project to Spark might be a step too far, at least consider using Apache Arrow, the data exchange platform Spark builds on top of. The pyarrow project would let you exchange data via Parquet files, or exchange data over IPC.

The Pandas and Numpy teams are quite heavily invested in supporting the needs of Arrow and Dask (there is considerable overlap in core members between these projects) and are actively working to make Python data exchange as efficient as possible, including extending Python's pickle module to allow for out-of-band data streams to avoid unnecessary memory copying when sharing data.


your code

for row in self.cursor:    csv_writer.writerow(row)

uses writer as-is, but you don't have to do that. You can filter the values to change some particular values with a generator comprehension and a ternary expression

for row in self.cursor:    csv_writer.writerow("null" if x is None else x for x in row)


You are asking for csv.QUOTE_NONNUMERIC. This will turn everything that is not a number into a string. You should consider using csv.QUOTE_MINIMAL as it might be more what you are after:

Test Code:

import csvtest_data = (None, 0, '', 'data')for name, quotes in (('test1.csv', csv.QUOTE_NONNUMERIC),                     ('test2.csv', csv.QUOTE_MINIMAL)):    with open(name, mode='w') as outfile:        csv_writer = csv.writer(outfile, delimiter=',', quoting=quotes)        csv_writer.writerow(test_data))

Results:

test1.csv:

"",0,"","data"

test2.csv:

,0,,data