Part 1 - Pandas and Pythons

Note: For source code, please visit the GitHub repository linked in the bottom of this post.

In this blog post, we will be going through some basic functions in Pandas that let us explore the QMJHL data from Pick224.  

Step 1: Loading and Understanding the Dataset

Let's begin by loading the QMJHLData.csv dataset into a Pandas DataFrame. This will allow us to work with the data efficiently. The first thing I want to know is the shape of the dataset, which will give us the number of rows and columns:









Step 2: Checking for Duplicates

Data cleanliness is crucial in any data analysis project. To ensure that we don't have any duplicate rows in our dataset, I'll use the drop_duplicates() method. This method removes duplicate rows and returns a new DataFrame with the unique rows:







After running this code, we'll see the shape of the DataFrame without any duplicates. If the output shape matches the original shape, we can be confident that our dataset has no duplicate rows. (We know this data was good because Dave at pick224.com is excellent.)


Step 3: Exploring Player Count by Year

Next, I want to understand how the player count varies across different seasons. To achieve this, I'll use the value_counts() method on the "SEASON" column. This method counts the occurrences of each unique value in the column:













The output will show the count of players for each unique season. This information will be helpful in understanding the distribution of players across different years.



Step 4: Filtering Drafted Players

To focus on players who were drafted, I'll filter out the rows where the "DRAFT TEAM" is not specified (indicated by a dash "-"). This will allow us to work exclusively with players who have been drafted to the NHL:










The output will provide us with the count of drafted players for each unique season, helping us understand the distribution over the years.


Conclusion

Thank you for joining me on this data analysis journey. If you're interested in exploring the code further, you can find the entire project on my GitHub repository: https://github.com/nathanahearn/hockeystats.

I hope you enjoyed this blog post and learned something new about hockey player data analysis in Python with Pandas.

Best regards,

Nate

Comments