Importing and visualizing csv data with Ruby on Rails and Chartkick

Importing csv data is a common and fairly standard task in Rails. Although it might look simple, there are quite a few things to think about when working with csv files. This is a short tutorial covering one approach of importing csv files with Ruby on Rails, and we’ll add some visualization with Chartkick as well. Let’s get started!

1. Setup a rails project to work with.

First, let’s setup a simple rails app with a Postgres database. (assuming you have Postgres installed). And run the command to create the database.

rails new csv_importer -d postgresql
rake db:create

Further setup includes the Chartkick gem and pry-rails, the latter is optional but very useful when playing around with the rails console. Add the gems to your gemfile and run bundle install.

gem 'chartkick'

group :development, :test do
  gem 'pry-rails'
end

2. The dataset

Now we have to find some data to work with. We could use a simple, clean csv that we generate ourselves, but where’s the fun in that? Let’s find something a bit more real world like. In this tutorial we will use the IMDB 5000 movie dataset from Kaggle. Kaggle has lot’s of interesting stuff for everyone that wants to experiment with data, including various free datasets.

We could add the file to our Rails root folder, although in production it’s more likely that you will access and parse the file from some url. If you want to follow along with this tutorial, upload the csv to some cloud storage provider (s3, dropbox or others) and make sure the file is publicly accessible.

3. Generate the Movie model and migration

Now that we have our dataset, we have to start thinking about how we want to store the data. For now at least we know that we need a movies table, as each row in the csv file represents a movie.

Generate a movie model and we’ll have a look the migration and required columns later.

rails g model Movie

When it comes to deciding what types to use, the first thing to do is look at the information we already have. Have a look at the preview section in the documentation that comes with the dataset on Kaggle. There’s a toggle included that you can use to display the intended data type for each of the columns. This will be our starting point for adding the columns to our migration.

Database types and how to use them is quite a large topic which you should definitely learn about if you haven’t already. This stackoverflow post might be a good place to start.

You can use the following command in the Rails console to show the data types that your database supports;

ActiveRecord::Base.connection.native_database_types.keys

If you’re using Postgresql, the list will be considerably longer than Mysql or Sqlite. But to keep things simple, we’ll stick with the few that are mentioned in the column descriptions in the Kaggle documentation (String, Numeric and DateTime).

So what else is important here? One thing we have to consider is making sure that our imported rows are unique. We don’t want duplicate rows in our database when updating or importing the csv more than once. Another reason is (as I found out) is that there are quite a few duplicate records in this dataset that we need to filter out.

The problem is that there is no real unique identifier for any particular row. You could say the movie title should be unique, but that doesn’t take into account remakes of movies with the same title. One possible solution is to validate the uniqueness of the director name and movie title combination. It’s a reasonable assumption that at least those should be unique.

We can add this constraint by adding a unique index for the director name and movie title. Later on we will add a validation in the code as well.

Keep in mind that there are more things here that could be important such as other indexes, constraints, precision and scale for working with currencies. But for now, this is sufficient.

class CreateMovies < ActiveRecord::Migration[5.0]
  def change
    create_table :movies do |t|
      t.string :color 
      t.string :director_name 
      t.decimal :num_critic_for_reviews 
      t.datetime :duration 
      t.decimal :director_facebook_likes 
      t.decimal :actor_3_facebook_likes 
      t.string :actor_2_name 
      t.decimal :actor_1_facebook_likes 
      t.decimal :gross
      t.string :genres 
      t.string :actor_1_name 
      t.string :movie_title 
      t.decimal :num_voted_users 
      t.decimal :cast_total_facebook_likes 
      t.string :actor_3_name 
      t.decimal :facenumber_in_poster 
      t.string :plot_keywords 
      t.string :movie_imdb_link 
      t.decimal :num_user_for_reviews 
      t.string :language 
      t.string :country 
      t.string :content_rating 
      t.decimal :budget
      t.decimal :title_year 
      t.decimal :actor_2_facebook_likes 
      t.decimal :imdb_score
      t.decimal :aspect_ratio 
      t.decimal :movie_facebook_likes 
      t.timestamps
    end
    add_index :movies, [:director_name, :movie_title], unique: true
  end
end

Run db migrate to create the movies table

rails db:migrate

4. Create a rake task to import the data

Now that we have our data and database setup we can start with our import script. Let’s generate a rake task to import the data.

rails g task csv_import movie_data

Make sure that you have a clear overview of the requirements of your import task. For example; where will the file come from?, file size, frequency of imports, memory usage, logging and more.

In this basic example task, we open and read the file from a url with openURI. This results in a string with the csv contents. We can use the Ruby CSV library to parse the results and convert each row to a hash and saving them to the database.

We want to show some useful output as well, so we’ll keep track of the total amount of rows imported and the number of duplicate records. If you’re running this in production on a frequent basis, it’s best to add a separate logger for this task. You can find more best practices concerning rake tasks in this excellent blog post

require 'csv'

namespace :csv_import do
  desc "Import movie data"
  task movie_data: :environment do
    url = 'https://path-to-your-file.csv'
    csv_string = open(url).read
    total_count = 0
    duplicate_count = 0
      CSV.parse(csv_string, headers: true, header_converters: :symbol) do |row|
        movie_hash = row.to_hash
        movie = Movie.create(movie_hash)
        if movie.persisted?
          total_count += 1
        else
          puts "duplicate: #{movie.director_name} - #{movie.movie_title}"
          duplicate_count += 1
        end
      end
    puts "Imported #{total_count} rows, #{duplicate_count} duplicate rows where not added"
  end
end

If you would run the task at this moment, Postgres will raise an uniqueness error. This is expected, as it probably encountered a duplicate record within the dataset. But we want to make sure that our tasks continues to run and save the records that are valid. We can do this by adding a validation to the Movie model.

class Movie < ApplicationRecord
  validates :movie_title, uniqueness: { scope: :director_name }
end

Because we use Movie.create without the ! (bang) operator in our rake task, the validation will silently fail, so our task continues to run. We use the persisted? method to check if the record is saved or not.

Now we can run the task again with:

rake csv_import:movie_data

Now the task continues to run, and will display the records that are already in the database (based on the unique director and movie title combination).

Have a look at the data in the Rails console by retrieving some records. It should include around 4919 rows. As you can see from our rake task output, there are quite a few duplicate rows in this dataset.

5. Visualizing data with chartkick

One the easiest ways to visualize data with with Rails is by using Chartkick. A very useful ruby gem that allows us to generate charts with only a few lines of Ruby. Under the hood, Chartkick can use Chart.js, Google Charts or Highcharts. We’ve already added the Chartkick gem to our gemfile so the only remaining setup is to include Chartkick in application.js.

To setup Chartkick with Chart.js, add chartkick and Chart.bundle to your application.js file.

//= require jquery
//= require jquery_ujs
//= require turbolinks
//= require Chart.bundle
//= require chartkick
//= require_tree .

We’ll generate two charts, displaying the following data;

  1. Our first chart will group movies by content rating e.g. pg-13, R. And display the number of movies in each group.

  2. Our second chart will display the amount of Facebook likes from each of the 30 highest rated movies (by imdb score).

Start with adding the following scopes to the movie model. The group finder method will group the movies by content rating, count the number of movies in each group and returns a Hash.

The second scope returns the 30 highest rated movies based on imdb_score and will return an ActiveRecord relation collection of objects.

class Movie < ApplicationRecord
  validates :movie_title, uniqueness: { scope: :director_name }
  
  scope :content_rating, -> { group(:content_rating).count }
  scope :highest_rated, -> { order(imdb_score: :desc).limit(30) }
end

You can experiment with these scopes in the Rails console by using Movie.content_rating and Movie.highest_rated.

Next, we need a controller and index view to display the charts.

rails g controller movies index

You can pass data to chartkick in a Hash or Array. Now our first scope (content_rating) already returns a Hash. So we can directly pass that to the view.

Our second scope returns an ActiveRecord relation, which we need to turn into a hash ourselves. A way to do this, is by creating a method that creates an empty hash, merges the selected data for each movie object and returns a hash that we can pass to chartkick. In this example, I’ve included the movie title and imdb score on the y-axis and the number of facebook likes on the x-axis.

class MoviesController < ApplicationController
  def index
    @content_rating = Movie.content_rating
    @movies_with_fb_likes = convert_to_hash(Movie.highest_rated)
  end

  private

  def convert_to_hash(relation)
    chart_data = {}
    relation.each do |movie|
      chart_data.merge!("#{movie.movie_title} #{movie.imdb_score}" => movie.movie_facebook_likes)
    end
    chart_data
  end
end

And finally, to generate the charts with the data, add the following lines to the index view.

<%= column_chart @content_rating, width: "600px" %>
<%= bar_chart @movies_with_fb_likes, height: "600px", width: "600px" %>

That’s it! Only one line of ruby in our view. Need a line chart instead? Just change it to;

<%= line_chart @content_rating, width: "600px" %>

Check out the Chartkick docs and do some experimenting with different charts and options!

Other useful sources

  1. http://dalibornasevic.com/posts/68-processing-large-csv-files-with-ruby

  2. https://edelpero.svbtle.com/everything-you-always-wanted-to-know-about-writing-good-rake-tasks-but-were-afraid-to-ask


Liked this post? Feedback or Questions? Let me know by liking or commenting on Medium