Tools used by data scientists

Tools that may be used by data scientists are categorized in terms of cloud infrastructure, data and application integration, data preparation and processing, analytics, visualization, programming, master data management, high performance processing, business intelligence reporting, and project management. A data scientist may use one or more of these tools on demand for data science problem-solving.

  • Cloud infrastructure: Such as Apache Hadoop, Spark, Cloudera, Amazon Web Services, Unix shell/awk/gawk, 1010data, Hortonworks, Pivotal, and MapR. Most traditional IT vendors have migrated their services and platforms to support cloud.
  • Data/application integration: Including Ab Initio, Informatica, IBM InfoSphere DataStage, Oracle Data Integrator, SAP Data Integrator, Apatar, CloverETL, Information Builders, Jitterbit, Adeptia Integration Suite, DMExpress Syncsort, Pentaho Data Integration, and Talend [Review 2016].
  • Master data management: Typical software and platforms include IBM InfoSphere Master Data Management Server, Informatica MDM, Microsoft Master Data Services, Oracle Master Data Management Suite, SAPNetWeaver Master Data Management tool, Teradata Warehousing, TIBCO MDM, Talend MDM, Black Watch Data.
  • Data preparation and processing: In Today [Today 2016], 29 data preparation tools and platforms were listed, such as Platfora, Paxata, Teradata Loom, IBM SPSS, Informatica Rev, Omniscope, Alpine Chorus, Knime, and Wrangler Enterprise and Wrangler.
  • Analytics: In addition to well-recognized commercial tools including SAS Enterprise Miner, IBM SPSS Modeler and SPSS Statistics, MatLab and Rapidminer [Rapid-Miner 2016], many new tools have been created, such as DataRobot [DataRobot 2016], BigML [BigML 2016], MLBase [Lab 2016], and APIs including Google Cloud Prediction API [Google 2016b].
  • Visualization: Many free and commercial software are listed in KDnuggets [KDnuggets 2015] for visualization, such as Interactive Data Language, IRIS Explorer, Miner3D, NETMAP, Panopticon, ScienceGL, Quadrigram, and VisuMap.
  • Programming: In addition to the main languages R, SAS, SQL, Python and Java, many others are used for analytics, including Scala, JavaScript, .net, NodeJS, Obj-C, PHP, Ruby, and Go [Davis 2016].
  • High performance processing: In Wikipedia [Wikipedia 2016a], about 40 computer cluster software are listed and compared in terms of their technical performance, such as Stacki, Kubernetes, Moab Cluster Suite, and Platform Cluster Manager.
  • Business intelligence reporting: There are many reporting tools available [Capterra 2016b;Wikipedia 2016c], typical of which are Excel, IBM Cognos,MicroStrategy, SAS Business Intelligence, and SAP Crystal Reports.
  • Project management: In Capterra [Capterra 2016a], more than 500 software and tools were listed for project management, including Microsoft Project, Atlassian, Podio, Wrike, Basecamp, and Teamwork.
  • Social network analysis: In Desale [Desale 2015], 30 tools were listed for SNA and visualization, such as Centrifuge, Commetrix, Cuttlefish, Cytoscape, EgoNet, InFlow, JUNG, Keynetiq, NetMiner, NetworkWorkbench, NodeXL, and SocNetV (Social Networks Visualizer).
  • Other tools: Increasing numbers of tools have been developed and are under development for domain-specific and problem-specific data science, such as Alteryx and Tableau for tablets; SuggestGrid and Mortar Recommendation Engine for recommender systems [Github 2016b]; OptumHealth, Verisk Analytics, MedeAnalytics, McKesson and Truven Health Analytics [Technavio 2016] for healthcare analytics; BLAST, EMBOSS, Staden, THREADER, PHD and RasMol for bioinformatics.


Note: Excerpted from “Longbing Cao. Data Science: A Comprehensive Overview



1. Solutions Review. 2016. Data Integration and Application Integration Solutions Directory. (2016). http://

2. Predictive Analytics Today. 2016. 29 Data Preparation Tools and Platforms. (2016). http://www.

3. RapidMiner. 2016. RapidMiner. (2016).

4. DataRobot. 2016. DataRobot. (2016).

5. BigML. 2016. BigML. (2016).

6. AMP Lab. 2016. MLBase. (2016).

7. Google. 2016b. Google Cloud Prediction API. (2016).

8. KDnuggets. 2015. Visualization Software. (2015).

9. Jessica Davis. 2016. 10 Programming Languages And Tools Data Scientists Used. (2016).

10. Wikipedia. 2016a. Comparison of cluster software. (2016). of
cluster software

11. Capterra. 2016b. Top Reporting Software Products. (2016).

12. Wikipedia. 2016c. List of reporting software. (2016). of reporting software

13. Capterra. 2016a. Top Project Management Tools. (2016).

14. Devendra Desale. 2015. Top 30 Social Network Analysis and Visualization Tools. (2015). http://www.

15. Github. 2016b. List of Recommender Systems. (2016). of
recommender systems

16. Technavio. 2016. Top 10 Healthcare Data Analytics Companies. (2016).