Unsure Where to Start Learning Data Science? Start Here

Unsure Where to Start Learning Data Science? Start Here

It is now a well-known trend, that data science and associated jobs have seen a significant increase in demand over the last decade. Companies have been predicting significant growth in data science roles that have either been realized or exceeded (1). But along with these predictions in demand also come greater expectations in associated skills required to support the needs of businesses (2).

Making matters more challenging, data science, machine learning, and artificial intelligence already have very broad meanings that include a diversity of skills and knowledge. For someone trying to learn how to become a data scientist, machine learning engineer, or Ai practitioner this broad inclusion of such a wide variety of skillsets and technologies can feel daunting, particularly when trying to determine where to begin. And while there are a great many resources available to help in this regard, each tackle the problem with a different set of assumptions regarding where the individual is coming from and the background she brings before ever reading the very first paragraph of the “Start here” guide. But here is the good news, and there is good news. Because the fields of data science, machine learning, and Ai are so broad in applicable skills, it is very likely that you have a skillset that may, at some point, help you to become your own flavor of data science professional.

Amidst the crowded “Start here” space, I provide to you my own recipe of recommendations regarding where to start. To that end, I have tried to strip out any assumption of where you may be coming from with the understanding that I have still had to make a few.

Assumptions

  1. I am assuming that you feel comfortable with a computer. What do I mean? Because data science and related technologies are largely dependent on computers, it is important that you have some comfort with computers and don’t mind patiently working through learning how to use them to interact with data.
  2. I am also assuming that you want to learn. Despite what you may have heard, data scientists and the like are lifelong learners who constantly search to learn new techniques, skills, and technologies and patiently work to apply what has been learned in creative ways to solve problems.

Okay, I lied, Start here:

The best place to begin, is to start by playing with data. Sounds simple right? Not so fast. Even the simple focus of wanting to learn how to mess with data entails learning a few associated skills and technologies. In order to be able to even begin with data, you need to also learn how to access data from a computer, which means learning software tools that enable computational access. And while there are an increasing number of no-code data science tools, having a solid understanding of how to leverage code for data science tasks will always ensure that your skills are as applicable across companies as they can be.

The best place to begin, is to start by playing with data.

Who’s your data?

Every good data scientist is good with data sets, and so your journey should begin by building an understanding of how data sets work including:

  1. Data Sets. This skill involves understanding how data sets are represented in different ways. Associated skills and buzz words to learn more about include:
    1. Comma separated values (CSV) files (also referred to as flat files)
    2. Database Tables with rows and columns
    3. Matrices (the numerical equivalent of tables),
    4. JSON for more advanced representations of data
  2. Database Technology. This skill involves understanding how database software is used to store data sets. The most common type of database technology is commonly referred to as relational or SQL databases. Associated skills and buzz words to learn more about include:
    1. Database management
    2. Data engineering
    3. Extract, Transform, Load (ETL)
    4. SQL, most common scripting language used to mess with data in databases
  3. Data wrangling. This skill involves leveraging scripting languages like SQL and Python or R to change data sets and data values in those data sets for different purposes. Associated skills and buzz words to learn more about include:
    1. SQL, (as noted above, but listed again to emphasize its importance) a database scripting language specifically used in relational databases
    2. Python and/or R (most popular scripting languages for basic and advanced data wrangling)
    3. Aggregating data
    4. Recoding data
    5. Reshaping data

At it’s core, learning how to leverage a scripting language like Python to access data from both flat files (CSV) and databases, wrangle that data by using Python directly and/or passing SQL commands to the database, and further wrangling the data using Python’s ever powerful Pandas library will get you a long way on your path to becoming a data scientist. In fact, a great many “data science” solutions may never involve a statistical model at all but rather some creative data wrangling to get the job done. Thus, learning how to interact with data is fundamental to becoming a good data scientist.

Get started by downloading my ebook, that delivers on building these foundational skills in the context of a specific business use-case and even dives into some additional concepts noted below.

What next?

Once foundations in accessing and wrangling data are laid, other skills that you should begin to learn to immediately apply to the data sets you are beginning to play with include the following:

  • Descriptive Statistics
    • Descriptive statistics, particularly measures of central tendency such as the arithmetic average (mean) and measures of variability such as the variance and standard deviation help to add more dexterity to your data wrangling skills.
  • Inferential Statistics & Probability
    • Probability Distributions
    • Significance Testing
    • Regression
  • Linear Algebra (aka Matrix Algebra) & Multivariable Calculus
    • Derivatives & Gradients
    • Scalers, vectors, matrices, & tensors
    • Cost Functions
    • Probabilistic Functions
  • Software Programming Concepts & Computation
    • Software packages (aka libraries)
    • Code versioning with Git
    • Deployments
      • API’s
      • Batch
    • Development Lifecycle (dev, test, prod)
    • Server Architecture
    • Performance Tuning
    • Containers
    • Kubernetes
  • Deep Learning
    • Although this is technically a more specific form of statistical modeling, deep learning architectures are valuable to learn because they can apply to a great number of business problems.
  • Data Visualization
    • Types of Visualizations
    • BI Tools (Power BI, Tableau)
  • Cloud Computing Environments
    • Passing Data to Cloud Ai-API’s
    • Google
      • Kubeflow & Ai Hub
    • Azure
      • Databricks
    • Amazon Web Services
      • SageMaker

As you embark on your data science journey, remember that many problems can be solved with some creative data wrangling, which underscores why I emphasize getting comfortable manipulating data. Once you feel comfortable being able to manipulate data, move on to the more advanced topics and always remember to apply what you are learning. Application, experimentation, and creativity go hand-in-hand.

Links to References

  1. https://www.dataversity.net/data-science-trends-in-2020/#
  2. https://www.bmmagazine.co.uk/business/data-science-will-grow-in-scope-by-2020/