What startups need to know about data science!

What startups need to know about data science!

What startups need to know about data science!

For those of you embarking on or thinking of embarking on building a startup product but are concerned about not having data science built in to your solution then this article is for you. Obviously, I am referring to products or services that do not have data science, artificial intelligence, or machine learning as their core function, which is most of you.

Being in the field of data science and having worked with local startups, I often get asked how to enable data science [or artificial intelligence (Ai) or machine learning, all being relative synonyms for most people] for startup products or companies more generally. Unfortunately, the buzz has bitten the beast, so to speak, and in all too many cases people will nearly stop production because they get too caught up in the confusing mess of technology and statistics required to enable any real useful application of data science, let alone its computational partner Ai. I don’t say this to discourage but rather to help focus efforts where they matter most. In this post, we will examine the role that data science should play as you build your startup solution. To that end, we will discuss why your product doesn’t need data science yet, what to look for in your initial market testing for future data science opportunities, and how to begin to lay the groundwork for future data science integration.  

Before going too much further, I want to pause briefly and quickly define what I mean by “data science.” Although the term is often obfuscated by associated technologies, when I refer to data science I am referring to the process of capturing data, transforming it so that it can be analyzed, using statistical models to find patterns in that data, and using those models to answer questions (e.g. make decisions). With this process we can answer complex questions that are challenging for humans to answer when the data are complex such as “Who is more at risk for a heart attack?” Or we can help machines to answer simple questions that are easy for humans but hard for machines such as “Is this a person or an animal?” But I digress, back to our discussion on data science and startups. 

Why your startup doesn’t need data science…yet…

First, let’s consider how most startups work. Most startup companies get started because the inventors/creators have identified a human problem that they can solve with their own unique, and often combined experiences. What is very important to keep in mind here is that your solution, the one you developed without data science but experience (okay, maybe a little data science or research at least for the more rigorous of us), is solving a human problem without data science (*mind blown*). Seems obvious but it is fundamental. Second, when you keep this core focus in mind, you see that data science, just like everyone else trying to sell your new startup something it doesn’t need, is a distraction that is meant to tame your startup motivation. Moreover, this illusion is actively perpetuated by the giants who own solutions (touting the coolness of data science capabilities) in the fields you are looking to penetrate that ultimately leave you feeling as though you simply can’t compete without data science.

All hyperbole aside, the core message is important to repeat; your product was created sans data science and so should be brought to market sans data science. But that doesn’t mean that you can’t prepare for the future…

What to look for in your initial market release…

Once your product is ready for an alpha release, it now becomes important to address the future opportunities that data science may help to bring to your product. But how do you prepare for data science when you are still struggling to figure out what it means? Remember that data science can help us, us being people (owners, users, customers, etc) or machines (apps, robots, phones, etc), to answer questions. What this means is that you need to be sensitive to the questions that both you and your core customer base have as they experience your product.

Case in point, a startup develops an application that allows people to keep track of their college friends in one centralized location. Fast forward 14 years and Facebook is now a tech giant making strong and notable contributions to basic data science but by no means started there. What Zuckerberg did recognize was that his users had questions and he sought to identify ways through which the data he collected could help his users answer those questions (“Has anyone posted a photo of me?”, “If I have to see advertisements, what ads are most relevant to me?”, “Can’t Facebook just automatically tag my friends?”, etc.).


The take home for this section is to listen to your users as you roll out your product. Focus groups, surveys, emails, or any opportunity to receive feedback is an opportunity to add context to the continued evolution of your product or service. Examine the questions and challenges they have and consider whether your solution can collect the necessary information to possibly answer the question. If you identify some information that your product naturally collects from customers that may answer their question, then bingo…you have a data science use case. Thus, data science should be use-case driven such that each data science solution is attached to clear business value.

Okay, I have got my use cases, what now?

Although getting into specifics surrounding how to establish a data science pipeline like the one I describe at the beginning is beyond the scope of this article, I will leave you with a few ideas to consider along with some resources for digging deeper. The key to answering any question using data science starts with data (like it is literally the beginning of the phrase). This means that you need to identify opportunities and some simple technologies to capture data.

Possible capture mechanisms include:

  • Relational Databases – SQLite, MySQL
  • Non-Relational Databases – MongoDB, PostgreSQL
  • File Systems – Basic Windows File system
  • Here is a useful description of some of the top open source DB solutions (https://blog.capterra.com/free-database-software/)

Relational databases can be great if you know exactly what you want to capture but non-relational databases provide more flexibility for collecting information that has less structure. Finally, file systems (like the one on your PC where you save Word docs and family photos) can also be used but because they will capture anything, and they are not easy to extract information from, these may not be the best option. No matter which solution you choose, try to find one that allows you to automatically collect the information from your product or service. This ensures greater consistency in the data and reduces the potential problem of building biased insights for future analytics. In other words, promising me that you will remember to enter all those survey responses from customers and save them in a file somewhere probably isn’t a good data capture strategy. Once you have a good or even decent mechanism for capturing and saving data the remaining steps can get a bit complex and may require a more traditional data scientist consultant to build the insights you are interested in leveraging. It is important to note that at this point I am grossly oversimplifying the data science process but by the time you get to this point, hopefully you have generated enough revenue and identified enough high-value use cases that it will justify hiring some additional help. For those of you who are interested in more technical details around setting up a more robust data science pipeline, I highly recommend the following blog series that teaches how to leverage cloud resources to execute an end-to-end data science pipeline:


Recognize the hurdles you must overcome before executing on data science in your products:

  • Hurdle 1: Do not over or under, but especially over in your early stages, -estimate the business value of data science
  • Hurdle 2: Be careful not to jump in without a defined plan and process
  • Hurdle 3: Keep in mind that collecting data means keeping information on people, so security and privacy will be important issues to address
  • Hurdle 4: When a high-value use case is identified, clearly define success metrics
  • Hurdle 5: Building data science requires some level of experience with data engineering, statistics, and scripting. Thus, it is essential to find a trusted partner to help enable your budding data science practice.

Thanks for reading and please feel free to reach out to let us know what you liked, didn’t like, or would like to see more of. We are particularly interested in any future content you would like us to examine so don’t be shy.