This chapter introduces the reader to data science, and describes the major stages of working with data (collect, explore, preprocess, visualize, predict, and infer knowledge). It sets the common expectations what constitutes a data science domain. This chapter will elaborate about Anaconda IDE, which will be used in the book.
Chapter 2. Data Acquisition
No of pages: 40
This chapter will introduce a reader how to retrieve and store data from/to various data sources: text files (including various formats like CSV, XML and JSON), binary files (including Apache Avro), Web accessible data, relational databases, NoSQL databases, Apache Arrow (as efficient and novel columnar data storage system), multi-modal databases, and network databases. This chapter will also introduce BeautifulSoup to work with XML and HTML.
Chapter 3. Basic Data Processing
No of pages: 40
These are standard Python libraries for scientific computing and processing data. NumPy encompasses all sorts of data structures required during data analysis. Here, we will provide examples that will illuminate the importance of sophisticated frameworks, and reuse based software engineering in the realm of data science.
Chapter 4. Documenting Work
No of pages: 20
This chapter introduces the most popular computing environment for data analysis. It makes sharing of results between data scientist possible in an easily reproducible manner.
Chapter 5. Transformation and Packaging of Data
No of pages: 30
This chapter illuminates a critical data science framework that is built upon NumPy. It provides excellent data structures for handling data frames and series.
Chapter 6. Visualization
No of pages: 40
This chapter introduces various ways to visualize data; summary statistics or tabular representations are of limited value in exploring data. The following frameworks will the topic of this chapter: matplotlib, glueviz, Bokeh, and orange3. Visualization is important both while doing exploratory analysis as well as when generating effective reports.
Chapter 7. Prediction and Inference
No of pages: 50
This chapter will talk about all techniques and technologies to properly scale data science efforts. It will teach readers how to create systems, that may formulate answers on unseen data, or find hidden patterns in data. It will elaborate about supervised, unsupervised, deep, and reinforcement learning methods. Moreover, it will introduce Apache Spark with MLib (both in batch and stream modes) as well as TensorFlow. The following frameworks will also be the topic of this chapter: XGBoost, sci-kit learn and Keras with PyTorch.
Chapter 8. Network Analysis
No of pages: 40
This chapter explores the ways to analyze complex networks and graphs. This chapter will introduce Apache Spark GraphX, Apache Giraph, and NetworkX. This chapter will also introduce spectral graph analysis, which is an interesting approximate, non-linear, and non-parametric machine learning method.
Chapter 9. Data Science Process Engineering
No of pages: 20
This chapter will elaborate how to share and customize data science practices/methods used by teams via OMG Essence.
Chapter 10. Multi-agent Systems, Game Theory and Machine Learning
Number of pages: 30
This chapter explores advanced data-oriented applications, where data are produced and consumed by self-governed intelligent agents. The chapter introduces the reader to the concept of multi-agent systems, game theoretic methods and models as well as associated learning algorithms.
Chapter 11. Probabilistic Graphical Models
Number of pages: 30
This chapter explains the most sophisticated form of a graph structure to model many advanced data science problems. Nodes in the graph denote random variables, while the links represent relations between those variables. This chapter equips the reader with a method that may be used when simpler solutions aren’t satisfactory.
Chapter 12. Security in Data Science
Number of pages: 20
This chapter presents techniques to anonymize data, and to deal with situations when learning methods must cope with adversarial modifications (a.k.a. adversarial machine learning). This chapter also talks about ways to protect data both in transit and in rest.
Appendix A - Crash Course in Python 3
No of pages: 20
This chapter will briefly teach readers about Python 3, and explain why Python 3 is a perfect choice for doing data science.
Ervin Varga is a Senior Member of IEEE and Professional Member of ACM. He is an IEEE Software Engineering Certified Instructor. Ervin is an owner of the software consulting company Expro I.T. Consulting, Serbia. He has an MSc in computer science, and a PhD in electrical engineering (his thesis was an application of software engineering and computer science in the domain of electrical power systems). Ervin is also a technical advisor of the open-source project Mainflux.
Gain insight into essential data science skills in a holistic manner using data engineering and associated scalable computational methods. This book covers the most popular Python 3 frameworks for both local and distributed (in premise and cloud based) processing. Along the way, you will be introduced to many popular open-source frameworks, like, SciPy, scikitlearn, Numba, Apache Spark, etc. The book is structured around examples, so you will grasp core concepts via case studies and Python 3 code.
As data science projects gets continuously larger and more complex, software engineering knowledge and experience is crucial to produce evolvable solutions. You'll see how to create maintainable software for data science and how to document data engineering practices.
This book is a good starting point for people who want to gain practical skills to perform data science. All the code will be available in the form of IPython notebooks and Python 3 programs, which allow you to reproduce all analyses from the book and customize them for your own purpose. You'll also benefit from advanced topics like Machine Learning, Recommender Systems, and Security in Data Science.
Practical Data Science with Python will empower you analyze data, formulate proper questions, and produce actionable insights, three core stages in most data science endeavors.