Big Data's big issue: Where are all the data scientists coming from?

This personnel gap isn't just a job-title change


Analysis Plug “data scientist” into Google and it is clear the job title has finally come of age and, suddenly there is a huge skills shortage.

An oft-quoted source about this shortage is a McKinsey Global Institute study, here. This predicts a talent gap of 140,000 to 190,000 people by 2018 in the US alone. I am always sceptical of IT projections more than 18 months ahead (let alone six years) but I am convinced there is currently a huge skills shortage that is not going away in the next 17 months and 29 days.

So, what is a data scientist? My favourite description comes from Twitter: “Yeah, so I'm actually a data scientist. I just do this barista thing in between gigs.” More cynically: “A data scientist is just an analyst who lives in California.”

Possibly more accurate is that a data scientist (DS) is “a better software engineer than any statistician and a better statistician than any software engineer”. In other words, an important part of the job is to be able to design novel analytical algorithms for specific sets of data and then be able to implement that algorithm in the appropriate computer language.

Data scientists excel at analysing data, particularly large amounts of data that does not fit easily into tabular structures, so called "Big Data."

For example, you should be able to point a data scientist at a web log and say: “Find the different patterns of behaviour in our users.” Or think about oil rigs for a moment. Breaking a drill bit during DIY work is irritating; in the middle of the North Sea it is annoying and very, very expensive. But if you collect enough sensor data (such as temperature, vibrations and RPM) you eventually have data for both normal running and breakages. You then point a data scientist at the data and say: “Build a system that predicts breakages before they happen.”

Data scientists are part artist and part engineer. They need a toolbox of techniques, skills, processes and abilities from which to construct novel solutions. And they need the ability to create a user interface that turns their abstract finding into something that the users of the system can understand, so data scientists also need the skills to create elegant visualisations that turn raw data into information. And they need to be able to communicate well with people. There is little use in creating a superb analytical process if you can’t communicate how and why it works to the board members.

And then there is the curiosity. Duncan Ross, director of data sciences at Teradata characterised data scientists well: “The first and most important trait is curiosity. Insane curiosity. In many walks of life evolution selects against the kind of person who decides to find out what happens 'if I push that button'. Data Science selects for it.”

So, what are the general characteristics of a DS?

They include: insatiable curiosity (see above), interdisciplinary interests, excellent communication skills and excellent analytical capabilities. Data scientists also need a good working knowledge of machine learning techniques, data mining, statistics, maths, algorithm development, code development, data visualisation and multi-dimensional database design and implementation.

Specific skills include the technologies to handle Big Data: NoSQL databases, Hadoop and related technologies and MapReduce and its implementation on differing software platforms. Data scientists also have an intimate knowledge of languages such as SQL, MDX, R and Functional and OOP languages such as Erlang and Java.

Data scientists will be required wherever large sets of data need to be analysed. This is true in the scientific world of course, but that is where the title is somewhat misleading because they are also needed in commercial organisations, in organisations like the NHS, government departments, defence and so on.

So where are all the data scientists going to come from? We’ve been "doing" data science at the School of Computing at the University of Dundee where I am chair of analytics, working with sets of Big Data as diverse as the output from mass spectrometers, image processing, web logs, data collected by games companies and so on.

This year, to run in parallel with our existing part-time Masters in BI, we are introducing a part-time Masters in Data Science. Most of the course is remote study because it is specifically designed for people already in employment in the database/analytical world who want to move into data science.

Fashions come and fashions go, but data scientists (whatever they may be called in the future) will endure. They will endure for the simple reasons that data is complex, the patterns within it are valuable, and spotting the patterns is difficult and requires an unusual mix of skills. ®

Mark Whitehorn holds the chair of analytics at the University of Dundee. His role involves working on data output from mass spectrometers, two-dimensional graphical traces of three-dimensional peaks that must be detected and their volumes calculated. The trick isn’t to do the sums; it’s to do them rapidly because another 8Gbyte output file is always coming.

Similar topics


Other stories you might like

  • Mastering metadata key to shifting data fast, says Arcitecta
    A new transmission protocol can work lightning fast, but only with very thorough records to pull from

    Companies that move and analyze huge volumes of data are always on the lookout for faster ways to do it. One Australian company says it has created a protocol that can "transmit terabytes per minute across the globe."

    The company, Arcitecta, which has an almost 25-year history, has just announced the new Livewire protocol, which is part of their data management platform, Mediaflux, used by institutions including the Australian Department of Defense, drug maker Novartis, and the Dana Farber Cancer Institute.

    According to CEO Jason Lohrey, Livewire itself has already made an impact for some of the largest data movers. "One of our customers transmits petabytes of data around the globe, he told The Register.

    Continue reading
  • Real-time data analytics firm Tinybird nets $37m in Series A
    Millions of rows per second in real time, so the legend goes...

    A data analytics company claiming to be able to process millions of rows per second, in real time, has just closed out a Series A funding round to take-in $37 million.

    Tinybird raised the funds via investors Crane Ventures, Datadog CPO Amit Agarwal, and Vercel CEO Guillermo Rauch, along with new investments from CRV and Singular Ventures.

    Tinybird's Stephane Cornille, said the company plans to use the funds to expand into new regions, build connectors for additional cloud providers, create new ways for users to build their own connectors, provide tighter Git integration, Schema management and migrations, and better defaults and easier materialization.

    Continue reading
  • Big data means big money for the UK government as £2bn tender mooted
    Central procurement team tickles the market with tantalising offer... but what for?

    The UK government is putting the feelers out for a bundle of big data products and services in a move that could kick off £2bn in tendering.

    Cabinet Office-run Crown Commercial Service (CCS), which sets up procurement on behalf of central government ministries and other public sector organisations, has published an early market engagement to test the suppliers' interest in a framework for so-called big data and analytics systems.

    According to the prior information notice: "Big data and analytics is an emerging and evolving capability, with its prominence heightened by COVID. It is fast becoming recognised as business critical and a core business function, with many government departments now including chief data officers. The National Data Strategy and implementation of its missions reinforce the requirement to access and interrogate Government data more effectively to improve public services."

    Continue reading

Biting the hand that feeds IT © 1998–2022