Robust CSV file import with DictReader and chardet

Wondering how to import CSV file in Python? I’ve got you covered! Read on and learn how to do it with DictReader and chardet.

As easy as it seems, Python is a language of great opportunities and mastery that comes with a lot of practice.

It has a lot of insanely useful libraries and csv (a member of which is the DictReader class) is definitely one of them.

This will be an introductory post so you don’t have to worry about your knowledge of Python. If you’re looking for tips on how to start learning Python, we’ve got something for you as well.

Import CSV file in Python: The absolute basics

You might ask: if it’s that easy, then why should I even read this?

Well… it’s easy, but it might also be a bit confusing because of the amount of options available.

Moreover, validating columns and detecting whether the file is a valid CSV file are not built-in functionalities of the csv library, so after the introduction I will describe those as well.

As I mentioned earlier, parsing the file is pretty simple:

import csv                                   # import the csv module
with open('example.csv') as csv_file:        # open example.csv as csv_file and iterate over rows
    reader = csv.DictReader(csv_file, delimiter=';')
    for row in reader:
        print row.get('Col 1') + ' ' + row.get('Col 2') # print out values of Col 1 and Col 2

… and that’s all!

Yup, it’s that easy. Well, at least the basics. We open the file, read all lines, get columns, profit!

What you might want to do besides that is:

  • check whether that really is a CSV file or not
  • validate that only the columns you want are there
  • check the encoding

Validation

csv has a thing called “Sniffer” that, given a portion of the file, checks whether it is valid or not (besides checking the dialect, it raises an exception when parsing an invalid file and that’s probably what you’re be looking for):

import csv
with open('example.csv') as csv_file:
    try:
        csv.Sniffer().sniff(csv_file.read(1024))  # take a 1024B (max) portion of the file and try to get the Dialect
        csv_file.seek(0)
    except csv.Error:
        print 'I did not expect the spanish inqusition!'

If you want to check whether columns provided are what you expect, it gets a tiny bit trickier. Firstly, provide the fieldnames parameter for the DictReader instance (they will be used as keys for the dictionary):

import csv
with open('example.csv') as csv_file:
    reader = csv.DictReader(csv_file, delimiter=';', fieldnames=['Col 1', 'Col 2'])
    for row in reader:
        print row.get('Col 1') + ' ' + row.get('Col 2')

Running that example you’ll notice, that the first row — containing the header — is printed as well; in general we don’t want that to happen.

Let’s change this:

import csv
with open('example.csv') as csv_file:
    is_first_row = True

    reader = csv.DictReader(csv_file, delimiter=';', fieldnames=['Col 1', 'Col 2'])
    for row in reader:
        if is_first_row:
            is_first_row = False
            continue   # skip the row

        print row.get('Col 1') + ' ' + row.get('Col 2')

Now that we skipped the header, let’s get back to it to check whether it has the column set we want.

Before we do it, I’ll define a small helper function:

from itertools import chain
def flatten_list(nested_list):
        return list(chain(*[item if isinstance(item, list) else [item] for item in nested_list]))

Don’t worry, it’s not as complicated as it seems to be.

We take each element of the list (item (...) for item in nested_list), check if it’s a list and if it isn’t, we make a list out of it and join all lists into a single one, that’ll give us a nice flattened list (i.e. for [1,2,[3,4]] we’ll get [1,2,3,4]).

Now we can improve our import yet again:

import csv
from itertools import chain

def flatten_list(nested_list):
        return list(chain(*[item if isintance(item, list) else [item] for item in nested_list]))

with open('example.csv') as csv_file:
    is_first_row = True
    valid_columns = ['Col 1', 'Col 2']

    reader = csv.DictReader(csv_file, delimiter=';', fieldnames=valid_columns)
    for row in reader:
        if is_first_row:
            current_columns = flatten_list(row.values())
            # ^-- if there are columns we don't want, row.values() will return ['Col 1', 'Col 2', ['Col 3', 'Col 4', (...)]]
            if set(valid_columns) != set(current_columns):
                # ^-- compare sets, because when comparing arrays order is important as well
                print 'This is not the file I expected! I quit!'
                break
            is_first_row = False
            continue

        print row.get('Col 1') + ' ' + row.get('Col 2')

Encoding detection

As you probably now already, text file content can be represented using different encodings, for example UTF-8, windows-1250, iso-8859-2 etc. In some cases we want to detect that encoding and decode strings so that we can parse them the way we want.

To do that, we’ll use the chardet library (it isn’t available by default, so you need to use either pip or easy_install to get it).
Doing it is (again) pretty easy and what you need to do is to read file content and then pass it to chardet.detect():

import chardet

csv_file_raw = csv_file.read()
encoding = chardet.detect(csv_file_raw)['encoding']

if not encoding:
    print 'No encoding found for the file! Is it valid?'

if 'UTF' in encoding:
    encoding = encoding.replace('-').lower()

The last two lines (replacing ‘-‘ if the encoding is UTF-* and then making it lowerscore are necessary if you want to decode a string using that information. To do that, I’ll define the last helper function:

def string_to_utf8(string, source_encoding):
    if source_encoding == 'utf8':
        return string
    else:
        return string.decode(source_encoding).encode("utf8")

Here we assume the target encoding is utf8 (UTF-8), and if the string isn’t in that format, we simply decode it and encode again using UTF-8.

Now let’s put together all things we discussed here:

import csv
import chardet
from itertools import chain

def string_to_utf8(string, source_encoding):
    if source_encoding == 'utf8':
        return string
    else:
        return string.decode(source_encoding).encode("utf8")

def string_list_to_utf8(string_list):
    return [string_to_utf8(element) for element in string_list]

def flatten_list(nested_list):
    return list(chain(*[item if isinstance(item, list) else [item] for item in nested_list]))

with open('example.csv') as csv_file:
    csv_file_content = csv_file.read()
    encoding = chardet.detect(csv_file_content)['encoding']

    if not encoding:
        print 'No encoding found for the file! Is it valid?'

    if 'UTF' in encoding:
        encoding = encoding.replace('-').lower()

    is_first_row = True
    valid_columns = ['Col 1', 'Col 2']

    reader = csv.DictReader(csv_file, delimiter=';', fieldnames=valid_columns)
    for row in reader:
        if is_first_row:
            current_columns = string_list_to_utf8(flatten_list(row.values()))
            if set(valid_columns) != set(current_columns):
                print 'This is not the file I expected! I quit!'
                break
            is_first_row = False
            continue

        print string_to_utf8(row.get('Col 1')) + ' ' + string_to_utf8(row.get('Col 2'))

As you can see, I added (now really the last) helper function, that changes encoding of columns we want to validate to utf8 to rule out the possibility of our code crashing on the string compare (if valid columns are unicode strings).

And that’s really all there is to import CSV file in Python! Isn’t that awesome?

Summary

As you can see, Python is a great language that enables you to solve problems faster and at a lower initial cost.

I hope my code helps you in one way or another as much as it helped me while working on one of our projects. Have a nice day!

Want to work with us? We're looking for talented programmers. Check out our openings.

About the author

Krzysztof Marciniak

Krzysztof Marciniak

Backend Developer
Programmer, sysadmin-like creature, wannabe-gamedev guy that sometimes does simple 3d modelling. 5th year student of Computing Science at the Faculty of Computing at Poznań University of Technology, currently majoring in Networking and Distributed Systems.

Related Articles