Top Articles

The world of a Data Scientist

During the last two to three years we’ve seen a proliferation of new data sources being introduced in the market place that has enabled people to generate more data in a couple of years than the entire data generated since the advent of the internet. This data explosion if you will, has introduced new challenges and opportunities for companies who are thinking of remaining competitive and relevant in this information age. Terms like Big Data have been used to describe this phenomenon happening as a result of a combination of data generated externally from sources such as social networks, sensor readings, network traffic, mobile devices, etc. and data generated internally from company systems.

However, this data in its raw form does not provide any value if it is not analyzed properly and if business does not take advantage of it. In order to successfully do that, we need the right combination of technology, people, business processes, and culture. At the centre of it sits a new field (some would argue that it is not new) and that is Data Science. It is through the adoption of a scientific process and fact-based decision making mindset that companies will start to see real tangible results in terms of increased overall business value. This scientific process is what we nowadays call Data Science. It is Data Science that helps turn big data into an advantage, value and impact.

Now you might ask what Data Science actually is. The fact is that there is no standard definition out there and that people coming from different backgrounds usually have various opinions about it. Take for instance Jean-Paul Isson who understands Data Science to be about understanding the business challenge and creating actionable insights which are then communicated back to the business. Hilary Mason and Chris Wiggins go a bit deeper into some of the technical areas involved in Data Science such as statistics, machine learning, mathematics, and domain expertise. On the other hand DJ Patil who actually coined the phrase “Data Scientist” back in 2008 at Linkedin, hints to the fact that Data Science is about creating immediate and massive impact on the business through data applications which come as a result of combining the use of data and science.

Data Science has led companies to undertake business transformation initiatives in order to adopt a data-driven culture within the organisation. At the centre of this is the role of the Data Scientist, which in 2008 according to an article in the Harvard Business Review has been called the sexiest job of the 21st century. But what exactly is a Data Scientist, what do Data Scientists do, where do they fit in within the organization, and how do they differ from say a Business Analyst?

Let’s start with the last point.

business_analyst_vs_data_scientist

As you can see from this illustration, Data Scientists when compared to Business Analysts and Data Engineers sit at the far right spectrum of the analytic process where predictive and prescriptive analytics is concerned. Data Engineers or Data Architects tend to usually work with problems around building the infrastructure (hardware and software), representing raw data in a computable format and moving data around from system to system. Business Analysts tend to focus more on reporting and summarization and interpretation in other words they focus on historical analysis of data. In contrast a Data Scientist is more concerned about generating insights by applying a scientific process to data analysis in order to unfold the present and predict the future. Let’s dive a bit deeper into the differences and common features of a Business Analyst vs Data Scientist.

ba_vs_ds

We can see that they share many common attributes such as SQL skills, data mining tool users, ability to communicate to the business. However when it comes to more advanced analytics where statistics, maths, engineering, and big data is involved we tend to incline more towards the Data Scientist role than the Business Analyst. However, it is important to understand that both these folks share the same purpose which is to turn data into insight into action. The methodology and tooling they use is different but the purpose remains the same. To further expand on the methodology that a business analyst follows compared to that of a Data Scientist. We tend to think that a business analyst is more concerned about giving answers to known questions using clean historical data allowing them to analyse past activity.  A Data Scientist’s methodology is more experimental in nature. It is more about working with messy data, searching for patterns and new insights. A Data Scientist explores the data and finds answers to questions that the business never thought to ask in the first place.

vertical_horizontal_ds

Now the world of Data Scientist is not black and white. There are differences within the Data Scientist role itself. One particular difference is based on knowledge and experience. We tend to think of two groups with regards to that: vertical and horizontal data scientists. The vertical data scientist is a specialist focused in a specific area: for instance he might be an expert in Hadoop and R, another might be an expert in Machine Learning and NoSQL, etc. The horizontal data scientist in contrast, has cross-discipline knowledge. He might be good at machine learning and statistics, programming, visualizations, domain expertise, storytelling. However, these candidates are hard to find. As a result, many companies have resorted to creating data science teams consisting of individuals with specialist skills in different areas (business, domain expertise, machine learning, databases, etc). This also enables a closer collaboration with the business and as such data science teams are usually embedded as part of other teams within the organization.

exploratory_vs_operational

On the other hand, when comparing Data Scientist on the basis of what they do in their day to day activities we can differentiate between exploratory and operational data scientists. In the first group a Data Scientist is typically found doing investigative analysis or experimentation using interactive statistical environments like R,SPSS and programming languages like Python. An Operational Data Scientist on the other hand is usually building systems which support scalable machine learning libraries, are production ready, and in line to be consumed by the business immediately. They each use different and sometimes overlapping tools and architectures.

Generally speaking we see Data Scientists use the following tools, with SQL, R, Python, Excel, and Hadoop leading the pack. Although we’re seeing an increasing amount of unstructured data being ingested and used by companies – SQL still remains strong in the Data Scientist’s world. The need to utilize existing resources and skills is what’s driving SQL’s dominance in this list. This is even more emphasized by the recent technological innovations from large companies around bringing SQL on top of Hadoop. Here’s a survey conducted by O’Reilly during the 2013 Strata Conference which shows some of the most commonly used tools by attendees of the conference working in Data and Non-Data roles.

tools

In the following example you can clearly see the tools being used by a real world data scientist and how they are categorized under experimentation and production. He uses these tools to develop algorithms, find, clean and transform data and ultimately build production systems and extract value from data.

expe_prod_tools

So where do Data Scientists fit in within an organization? We see organisations positioning data science teams in two different ways:

In some occasions a data science team is kept separate from other teams and communication is limited to meetings and planning sessions.

separate

In other occasions we’re seeing Data Science teams becoming an integral part of the development team. That effectively increases the influence of the data science team to drive product innovations and strategic positioning of the organization in the market.

integrated

A Data Scientist will typically follow a common process to generate insight. That process may differ, however we see the following phases in a typical data science project that are quite common in the industry. Acquire, Transform, Model, Learn, Develop Data Product, and Deliver Insight. Naturally you would segregate this process into two groups: The data preparation and wrangling phase and the insight generation phase.

datascience_proces

It is broadly accepted that data exploration, wrangling and modelling are the most time consuming tasks in this process. Usually, we find a Data Scientist spending almost 80% of the time engaging in typical data wrangling tasks. And this is exactly where we’re seeing many companies innovating. Effectively bringing that 80% figure down and shifting focus from data wrangling to actual insight and business value generation.

In summary, Data Scientists are increasingly becoming influencers affecting key decision making and infusing a fact-based based decision making mindset within organizations. The world of a Data Scientist is not black and white and we’ve seen previously that the role itself can be viewed from many different perspectives and that people usually approach the role and the field itself from different angles. What’s important to note is that Data Science is all about converting data into insight into action. This can be achieved through embracing a data-driven culture, investing in people and skills, and adopting new technologies and tools, preferably in that order. 

Email me when people comment –

You need to be a member of Graduate Data Science to add comments!

Join Graduate Data Science