Are you looking for a gentle introduction to object oriented programming in Python? Well the you are in the right place! In this article we tell you everything you need to understand what object oriented programming is, why object oriented programming is useful, and how to implement basic object oriented programming in python.
This article was created as part of a larger case study on developing data science models. That being said, it is also a great standalone resource if you are looking for a gentle introduction to object oriented programming in Python.
What is object oriented programming?
What is object oriented programming? Object oriented programming is a paradigm where rather than treating all functions and pieces of data as separate entities, associated pieces of code and data can be bundled up into one object. These objects can have pieces of data that they keep track of internally as well functions that have access to this internal data.
Classes and instances
The most fundamental concept in object oriented programming is that of a class. A class is a code blueprint that defines an object, including the behaviors it can perform and the data it holds. Once you have defined a class, you can create multiple instances of that class. An instance of a class is the actual object that is created using the code blueprint.
For example, you might have a class called Dog that defines how to create a Dog object. You can then use that blueprint to create multiple instances of your Dog class that represent different dogs. Depending on what your needs are, the instances of your Dog class can exist completely independently or they can interact with each other.

Methods and attributes
A class can have any number of methods and attributes. A method is a function that allows the object to perform certain behaviors and an attribute is a piece of data that is stored in the object. Different instances of the same class will have access to the exact same methods and attributes. However, the exact data that is stored within the attributes may be different from instance to instance. The data in the attributes can be set when the class is initialized or by specific methods that modify that data.
For example, say you had a class called Dog with two attributes called name and is_thirsty. The name would likely be set when the instance was created, but is_thirsty might change over time as other methods are called. For example, is_thirsty might keep track of whether the drink_water method has been called. Some instances of your Dog class may have is_thirsty set to true because drink_water has not been called, whereas others might have is_thirsty set to false.

Object-oriented programming and inheritance
Another concept that is commonly associated with object oriented programming is that of inheritance. This is a pattern that allows you to reuse code that is shared across different classes by allowing multiple classes to inherit from the same base class.
As a conceptual example, if you were coding up a class called Cat and a class called Dog then you might want to let both of those classes inherit from a base class called Animal. This would allow you to store shared methods or attributes that are common to your Cat and Dog classes in your Animal base class. For example, your Animal base class might have a method called drink_water that is applicable to both your Cat and Dog classes.
When is object oriented programming useful?
When is object oriented programming useful? Here are some scenarios where object oriented programming is particularly useful.
- Variations on similar code. Object oriented programming can be useful if you find yourself repeating code multiple times because you have multiple different functions with different variations on the same code. If you find yourself frequently repeating small bites of code in different places, you might be better off using object oriented programming so that you can make use of inheritance and store any shared code in a base class.
- Many functions that take the same static argument. Object oriented programming can also be useful if you find yourself passing the same arguments around to multiple functions. Rather than passing in these arguments each time you call a function, you can just bundle all of these functions up in a class and save the static argument as an attribute of a class. If you save the argument as an attribute of a class, then the methods associated with that class will all have implicit access to that data.
- Many related functions that should logically be grouped. Do you have many related functions that feel like they should be logically grouped together? This is another hint that you might be better off using object oriented programming. For example, if you are working on a data science project and you have multiple functions that are related to feature generation, it might be useful to have a feature generator class.
- Keeping track of states. Another scenario where it may be useful to use object oriented programming is if you need to keep track of states. Returning to the Cat and Dog example, say you want to be able to keep track of whether a dog is awake or asleep. You may be able to do this but updating an external variable, but the cleanest way to do this is to create a Dog class with an is_awake argument that can be changed to reflect whether the Dog is currently awake.
Why use object oriented programming?
Why use object oriented programming? Here are some of the benefits of using object oriented programming.
- Reuse code through inheritance. One of the most compelling reasons to use object oriented programming is that it allows you to share common code through inheritance. This reduces the amount of code that you need to write and maintain. More importantly, it also prevents against errors and inconsistencies that might otherwise occur if you need to make an update to a certain shared piece of code and you only remember to update it in two of the three places it appears.
- Less passing of variables. Code starts to get messy when you find yourself passing many, many variables to your functions. Methods associated with a given class have access to the internal data related to a class. That means that rather than passing the same argument to four different functions, you can set that argument as an attribute of a class and set those functions as methods of that class. Now the methods will implicitly have access to that data without the data needing to be explicitly passed to them each time.
- Bundle code into logical units. Using object oriented programming makes it easy to bundle multiple functions that perform similar functions into one logical unit. This makes it much easier for others who are reading your code for the first time to navigate and understand your code.
- Multiple instances of the same object with different internal states. Using object oriented programming makes it much easier to keep track states that change over time. This is especially true if you need to keep track of multiple different kinds of states or different versions of the same state that are associated with separate entities.
Object oriented programming in Python
Are you ready to get started with object oriented programming in Python? In this section we will go over some of the high-level concepts associated with object oriented programming in Python. In the next section we will go into more detail and provide examples with code.
How to define classes in Python
The first thing you need to know about object oriented programming in Python is how to define a basic class in Python. For this section of the article, we will continue along with the Dog example that we followed earlier in the article.
Just like you use the keyword def to tell Python that you are about to define a function, you can use the keyword class to tell Python that you want to define a class. This keyword should be followed by the name of the class you want to create, which in our case is Dog. Optionally, you can also include the name of the base class that you want this class to inherit from in parentheses after the name of your class. In this case, we will inherit from another class called animal that we will assume has already been defined elsewhere.
The class keyword, class name, and base class will serve as the header of your class definition. Under this line, you can define any methods that you want your class to have. Generally, you will at least need to define an __init__ method, which is the initialization method that is called when a new instance of your class is created. We will discuss this more in a minute.
class Dog(Animal) def __init__(self): pass
Passing data using self
In addition to class, there is another important keyword you need to know before defining a class in Python, which is self. The keyword self is used to access internal data (or attributes) that are associated with your class. For example, if you want to save a piece of data called is_hungry as an internal datapoint that is accessible from within your class, you would type the following.
self.is_hungry = true
Any method associated with your class that needs to access the data stored in you class needs to have self passed in as the first argument of the function definition. Continuing along with the Dog example, you might define the following method as part of your Dog class.
def drink_water(self): self.is_thirsty = False
Note that when you are actually calling a method associated with a class, you do not need to pass anything for the self argument. This argument gets passed implicitly. For example, since the drink_water method does not have any other arguments, you can call the drink_water method on an Instance of your Dog class called dog using the following command.
dog.drink_water()
The __init__ method
As we mentioned before, the __init__ method of a Python class is the method that defines how an instance of your class is created. Any arguments that are required to be passed to this method will be required to create a new instance of your class. If you want to save the values of arguments that are passed to the __init__ method so that they will be available to other methods in the class, you should make sure to save the data to the self object.
Below is an example that contains a basic __init__ method for our Dog class. This method requires that an argument called name is passed in when a new instance of the Dog class is created.
class Dog(Animal): def __init__(self, name): self.is_thirsty = True self.name = name def drink_water(self): self.is_thirsty = False
Creating an instance of a Python class
Now we have created a basic Python class and we want to create an instance of this class. In order to do this, you simply need to type the name of the class, followed by any arguments that are required by the __init__ method of the class. For example, here is how we would create a new instance of our Dog class.
dog = Dog(name='spot')
Other special methods
There are a few special types of methods that can be associated with a Python class. We will go over these special methods in the following section.
Static methods
The first special method you should know about is a static method. A static method is simply a method that does not require the self object to be passed. That means that it does not use or update any of the data stored in the attributes and it does not call any other methods associated with the class. If you want to make a method a static method, you can use the @staticmethod decorator and not pass self as an argument to your method.
For example, if we wanted to add a bark method to our dog class that simply printed out the word woof, our code might look something like this.
@staticmethod def bark(): print('woof')
Class methods
The second special type of method you should know about is a class method. A class method is a method that creates an instance of your class. Generally class methods parse some data then format the data in the way it is required for the init method of your class before creating a new instance of a class.
In order to define a class method, you need to use the @classmethod decorator and then use the keyword cls to represent the class. The cls keyword is similar to the self keyword in that you need to include it as the first argument when you define the function, but you do not need to include it when you actually call the function.
For example, if we wanted to have a method that allows you to read in a dictionary and create a dog object based on the contents of that dictionary, you could create a class method that looks something like this.
@classmethod def from_dict(cls, dct): name = dct.get('name') return cls(name=name')
After you define your class method, you can call the method like this.
dct = {'name': 'spot'} dog = Dog.from_dict(dct)
Using OOP to prepare data for modeling
Now we will use object oriented programming to complete the next step of our case study on building production-ready data science models. If you were just looking for a high level introduction to object oriented programming in Python, you can drop off now. If you want to learn more about building out production-ready models, we recommend that you check out our case study overview for more details.
For this step of the case study, we will build out some basic code to read in our data and prepare the data for model training. As you remember from our post on checking data quality, the data we are using is already pretty clean and ready to use. That means that we will just need to encode any categorical variables and split the data into a test set and a training set.
For the sake of this example, we will assume that the bank-additional-full data we have been using in our case study is stored within the main directory for our project. Specifically, we will assume the data is stored under our directory in this relative path: /data/inputs/bank-additional-full.csv
Create a basic script
Before we actually implement any object oriented programming, we will create some simple functions to perform all the tasks we need to complete. We will start out by creating functions that read in the data, one hot encode any categorical variables, split the data into a training set and a test set, and write the data out.
We will place these functions in the package we created in the previous steps of this case study so that we can import the functions and use them in other scripts.
import os from pathlib import Path import pandas as pd import numpy as np from sklearn.model_selection import train_test_split import sklearn.preprocessing as preprocessing from bank_deposit_classifier.sample import upsample_minority_class DATA_DIR = os.path.join( Path(__file__).parents[2], 'data/intermediate' ) def get_data(data_path, features = None): data = pd.read_csv(data_path, sep=';') if features: data = data[features] print(f'Data read from {data_path}') return data def encode_data(data, one_hot_encoder = None, categorical_features = None): if categorical_features: categorial = data[categorical_features] else: categorical = data.select_dtypes(exclude=np.number) categorical_features = categorical.columns if not one_hot_encoder: one_hot_encoder = preprocessing.OneHotEncoder( sparse=False, drop='first' ) categorical = one_hot_encoder.fit_transform(categorical) categorical = pd.DataFrame(categorical, columns=one_hot_encoder.get_feature_names(categorical_features)) continuous = data.select_dtypes(include=np.number) data = pd.concat([categorical, continuous], axis=1) return data, one_hot_encoder def split_data(data, outcome): sampled_data = upsample_minority_class(data, outcome, 0.5) train, test = train_test_split(data, random_state=123, train_size=0.8) return train, test def save_data(data, file_name = 'data.csv'): path = os.path.join(DATA_DIR, file_name) data.to_csv(path, index=False) print(f'Data written to {path}')
After we create our functions, we will create a script that imports and applies all of the functions. We will create a new directory within our main project directory under data/intermediate path to store the datasets that are created by this script.
import os from pathlib import Path from bank_deposit_classifier.prep_data import * INPUT_DATA_DIR = os.path.join( Path(__file__).parents[1], 'data/input' ) # TODO: preserve base outcome name after one hot encoding outcome = 'y_yes' data_path = os.path.join(INPUT_DATA_DIR, 'bank-additional-full.csv') features = None categorical_features = None data = get_data(data_path, features = features) data, encoder = encode_data(data, categorical_features) train, test = split_data(data, outcome) save_data(train, 'train.csv') save_data(test, 'test.csv')
Create a basic class
After we create our basic functions, we will create a class that can be used to encapsulate all of these functions. We will call this class DataPrep and store relevant pieces of data such as the name of the outcome variable and the features we want to filter to as attributes of the class.
import os from pathlib import Path import pandas as pd import numpy as np from sklearn.model_selection import train_test_split import sklearn.preprocessing as preprocessing from bank_deposit_classifier.sample import upsample_minority_class INTERMEDIATE_DATA_DIR = os.path.join( Path(__file__).parents[2], 'data/intermediate' ) class DataPrep: def __init__(self, outcome, features=None, categorical_features=None, one_hot_encoder=None): self._outcome = outcome self._features = features self._categorical_features = categorical_features self._one_hot_encoder = one_hot_encoder def get_data(self, data_path): data = pd.read_csv(data_path, sep=';') if self._features: data = data[self._features] print(f'Data read from {data_path}') return data def encode_data(self, data): if self._categorical_features: categorial = data[self._categorical_features] else: categorical = data.select_dtypes(exclude=np.number) categorical_features = categorical.columns if self._one_hot_encoder: one_hot_encoder = self._one_hot_encoder else: one_hot_encoder = preprocessing.OneHotEncoder( sparse=False, drop='first' ) categorical = one_hot_encoder.fit_transform(categorical) categorical_columns = one_hot_encoder.get_feature_names(categorical_features) categorical = pd.DataFrame(categorical, columns=categorical_columns) continuous = data.select_dtypes(include=np.number) data = pd.concat([categorical, continuous], axis=1) return data, one_hot_encoder def split_data(self, data): sampled_data = upsample_minority_class(data, self._outcome, 0.5) train, test = train_test_split(data, random_state=123, train_size=0.8) return train, test def save_data(self, data, file_name = 'data.csv'): path = os.path.join(INTERMEDIATE_DATA_DIR, file_name) data.to_csv(path, index=False) print(f'Data written to {path}')
After we update our package, we can update the script that imports the package to make use of our new DataPrep class. We will create an instance of our DataPrep class and pass it all of the data that is needed to create the class.
import os from pathlib import Path from bank_deposit_classifier.prep_data import * INPUT_DATA_DIR = os.path.join( Path(__file__).parents[1], 'data/input' ) # TODO: preserve base outcome name after one hot encoding outcome = 'y_yes' data_path = os.path.join(INPUT_DATA_DIR, 'bank-additional-full.csv') features = None categorical_features = None dp = DataPrep( outcome=outcome, features=features, categorical_features=categorical_features ) data = dp.get_data(data_path) data, encoder = dp.encode_data(data) train, test = dp.split_data(data) dp.save_data(train, 'train.csv') dp.save_data(test, 'test.csv')
Add a class method
Now we have created a class that contains all of our data preparation functions. Our next step will be to add a class method. You can add any kind of class method you want, but for now we will add a method that creates a DataPrep class from a yaml config file. Within the class method, we will read in the contents of the file as a dictionary, parse out the arguments required for initialization, and pass the arguments to a new DataPrep instance.
import os from pathlib import Path import yaml import pandas as pd import numpy as np from sklearn.model_selection import train_test_split import sklearn.preprocessing as preprocessing from bank_deposit_classifier.sample import upsample_minority_class BASE_DIR = Path(__file__).parents[2] DATA_DIR = os.path.join(BASE_DIR, 'data') CONFIG_DIR = os.path.join(BASE_DIR, 'config') class DataPrep: def __init__(self, outcome, features=None, categorical_features=None, one_hot_encoder=None): self._outcome = outcome self._features = features self._categorical_features = categorical_features self._one_hot_encoder = one_hot_encoder def get_data(self, path): path = os.path.join(DATA_DIR, path) data = pd.read_csv(path, sep=';') if self._features: data = data[self._features] print(f'Data read from {path}') return data def encode_data(self, data): if self._categorical_features: categorial = data[self._categorical_features] else: categorical = data.select_dtypes(exclude=np.number) categorical_features = categorical.columns if self._one_hot_encoder: one_hot_encoder = self._one_hot_encoder else: one_hot_encoder = preprocessing.OneHotEncoder( sparse=False, drop='first' ) categorical = one_hot_encoder.fit_transform(categorical) categorical_columns = one_hot_encoder.get_feature_names(categorical_features) categorical = pd.DataFrame(categorical, columns=categorical_columns) continuous = data.select_dtypes(include=np.number) data = pd.concat([categorical, continuous], axis=1) return data, one_hot_encoder def split_data(self, data): sampled_data = upsample_minority_class(data, self._outcome, 0.5) train, test = train_test_split(data, random_state=123, train_size=0.8) return train, test def save_data(self, data, path): path = os.path.join(DATA_DIR, path) data.to_csv(path, index=False) print(f'Data written to {path}') @classmethod def from_yaml(cls, path): path = os.path.join(CONFIG_DIR, path) with open(path, 'r') as file: params = yaml.load(file, Loader=yaml.FullLoader) return cls(**params)
After we make our changes to our DataPrep class, we will also make changes to the script that uses that class. Now the script will read in any configuration parameters from an external file rather than having the configuration parameters hardcoded in the code. This is generally a good pattern to follow.
from bank_deposit_classifier.prep_data import * input_path = 'input/bank-additional-full.csv' train_path = 'intermediate/train.csv' test_path = 'intermediate/test.csv' config_path= 'data_prep.yaml' dp = DataPrep.from_yaml(config_path) data = dp.get_data(input_path) data, encoder = dp.encode_data(data) train, test = dp.split_data(data) dp.save_data(train, train_path) dp.save_data(test, test_path)
Static methods
Finally, we will add an example of a static method to our class. For this example, we will add a simple method called standardize_name that reads in a string, replaces the whitespace with underscores, and lowercases the entire string. This will be useful for standardizing the names of the features in our dataset.
import os from pathlib import Path import re import yaml import pandas as pd import numpy as np from sklearn.model_selection import train_test_split import sklearn.preprocessing as preprocessing from bank_deposit_classifier.sample import upsample_minority_class BASE_DIR = Path(__file__).parents[2] DATA_DIR = os.path.join(BASE_DIR, 'data') CONFIG_DIR = os.path.join(BASE_DIR, 'config') class DataPrep: def __init__(self, outcome, features=None, categorical_features=None, one_hot_encoder=None): self._outcome = outcome self._features = features self._categorical_features = categorical_features self._one_hot_encoder = one_hot_encoder def get_data(self, path): path = os.path.join(DATA_DIR, path) data = pd.read_csv(path, sep=';') if self._features: data = data[self._features] print(f'Data read from {path}') return data def encode_data(self, data): if self._categorical_features: categorial = data[self._categorical_features] else: categorical = data.select_dtypes(exclude=np.number) categorical_features = categorical.columns if self._one_hot_encoder: one_hot_encoder = self._one_hot_encoder else: one_hot_encoder = preprocessing.OneHotEncoder( sparse=False, drop='first' ) categorical = one_hot_encoder.fit_transform(categorical) categorical_columns = one_hot_encoder.get_feature_names(categorical_features) categorical = pd.DataFrame(categorical, columns=categorical_columns) continuous = data.select_dtypes(include=np.number) data = pd.concat([categorical, continuous], axis=1) data = self.fix_column_names(data) return data, one_hot_encoder def split_data(self, data): sampled_data = upsample_minority_class(data, self._outcome, 0.5) train, test = train_test_split(data, random_state=123, train_size=0.8) return train, test def save_data(self, data, path): path = os.path.join(DATA_DIR, path) data.to_csv(path, index=False) print(f'Data written to {path}') def fix_column_names(self, data): data.columns = [self.standardize_names(x) for x in data.columns] pattern = self._outcome + '_' matches = [x for x in data.columns if re.match(pattern, x)] if len(matches) != 1: raise Exception('Cannot uniquely identify outcome column!') data = data.rename(columns={matches[0]: self._outcome}) return data @staticmethod def standardize_names(name): return re.sub('\W', '_', name).lower() @classmethod def from_yaml(cls, path): path = os.path.join(CONFIG_DIR, path) with open(path, 'r') as file: params = yaml.load(file, Loader=yaml.FullLoader) return cls(**params)
Since standardize names is only called internally within the DataPrep class, there will be no changes to our script required.
Note: There is a typo in the encode_features class where categorical is misspelled as categorial. We will be updating this.