-
Notifications
You must be signed in to change notification settings - Fork 40
Data Science
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.
Data Science is about data gathering, analysis and decision-making. Data Science is also about finding patterns in data, to make future predictions. There are awesome and freely available Data Science curriculum that are online.
https://developers.google.com/learn/topics/datascience
https://towardsdatascience.com/is-data-science-really-a-science-9c2249ee2ce4
https://github.com/datasciencemasters/go
https://www.unifyingdatascience.org/html/index.html
A data science-driven organization is an entity that maximizes the value from the data available while using machine learning and analytics to create a sustainable competitive advantage.
Data engineers build and maintain the systems that allow data scientists to access and interpret data. The role generally involves creating data models, building data pipelines and overseeing ETL (extract, transform, load).
What's the difference between Data Analytics vs Data Science? What is the difference between Data Science vs Data Engineering?
https://cloud.google.com/training/data-engineering-and-analytics
GCP smart analytics platform can help strip out layers of complexity and analyze data to solve problems in broad areas of applications such as anomaly detection, data monetization, general analytics, log analytics, pattern recognition, predictive forecasting, real-time clickstream analytics, time-series analytics and working with data lakes.
These topics require diverse range of knowledge and skills from many disciplines. The need to distinguish data engineers from analysts and scientists diminish when faced with such multi-disciplinary scope of endeavors.
AI Platform is a development platform to build AI apps that run on Google Cloud and on-premises. Take your ML projects to production, quickly, and cost-effectively.
AI Platform training with built-in algorithms.
AI Hub offers a collection of components for developers and data scientists building artificial intelligence (AI) systems.
https://www.youtube.com/watch?v=XXvFHqLv9p8
Document AI let's you unlock insights from documents with machine learning. Google Cloud’s Vision OCR (optical character recognition) and form parser technology uses industry-leading deep-learning neural network algorithms to perform text, character, and image recognition in over 200 languages with exceptional accuracy. Using the same deep machine learning technology that powers Google Search and Assistant, Google Cloud’s Document AI products enable you to derive valuable insights from your unstructured documents.
Dialogflow is a natural language understanding platform that makes it easy to design and integrate a conversational user interface into your mobile app, web application, device, bot, interactive voice response system, and so on. Using Dialogflow, you can provide new and engaging ways for users to interact with your product.
Machine Learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.
There are common areas of interests in both Machine Learning Engineering and Data Science and some differences.
AutoML Tables enables your entire team to automatically build and deploy state-of-the-art machine learning models on structured data at massively increased speed and scale.
Time-series analysis is essential for day-to-day operation of many companies. Most popular use cases include analyzing foot traffic and conversion for retailers, detecting data anomalies, identifying correlations in real time over sensor data, or generating high-quality recommendations. With Cloud Inference API, you can gather insights in real time from your typed time-series datasets.
Scikit Learn python library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.
Tensorflow is an end-to-end open source platform for machine learning. It has a comprehensive ecosystem of tools and libraries to build and deploy ML powered applications.
Compute Engine provides graphics processing units (GPUs) that you can add to your virtual machine instances. You can use these GPUs to accelerate specific workloads on your instances such as machine learning and data processing.
https://www.youtube.com/watch?v=jUZhe1aTnFk
Tensor Processing Units (TPUs) are Google’s custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs are designed from the ground up with the benefit of Google’s deep experience and leadership in machine learning.
https://www.youtube.com/watch?v=2kSo7Az4ZOs
Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the Tensorflow library.
Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models.
Colab supports Jupyter notebooks allow you to combine executable code and rich text.
Jupyter Notebooks combine code, data and visualizations for reproducible analytics.
Cookiecutter Data Science template is a logical, reasonably standardized, but flexible project structure for doing and sharing data science work.
You may become a more efficient, practical and productive data scientist by learning to leverage the power of the command line.
https://www.datascienceatthecommandline.com/
Some CLI tools can be useful in data science.
GNU datamash is a command-line program which performs basic numeric, textual and statistical operations on input textual data files.
GNU parallel is a shell tool for executing jobs in parallel using one or more computers.
Awk is a record processing tool written by Aho, Weinberger, and Kernighan in the 1970s. AWK is an acronym of their names. Data scientists have rediscovered awk recently.
Data lineage uncovers the life cycle of data—it aims to show the complete data flow, from start to finish. Data lineage is the process of understanding, recording, and visualizing data as it flows from data sources to consumption. This includes all transformations the data underwent along the way—how the data was transformed, what changed, and why.
https://www.keboola.com/blog/data-lineage-tools
https://www.guru99.com/data-science-tutorial.html
https://towardsdatascience.com/mathematics-for-data-science-e53939ee8306
https://www.kdnuggets.com/2022/02/mlm-hidden-building-block-machine-learning.html
Calculus 1C: Coordinate Systems & Infinite Series
https://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/
http://ocw.mit.edu/courses/mathematics/18-02sc-multivariable-calculus-fall-2010/index.htm
Intro to Descriptive Statistics
Intro to Inferential Statistics
https://www.kdnuggets.com/2022/02/complete-collection-data-science-cheat-sheets-part-1.html
https://www.kdnuggets.com/2022/03/best-data-science-books-beginners.html
https://www.kdnuggets.com/2022/03/build-machine-learning-web-app-5-minutes.html
https://medium.com/talabat-tech/data-apps-from-local-to-live-in-10-minutes-a886d5453c7
- https://cloud.google.com/vertex-ai/docs
- https://cloud.google.com/training/machinelearning-ai#data-scientist-learning-path
- https://codelabs.developers.google.com/?cat=machinelearning
- https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/main/notebooks/official/pipelines
- https://cloud.google.com/blog/topics/developers-practitioners/lets-get-it-started-triggering-ml-pipeline-runs
- https://cloud.google.com/blog/products/ai-machine-learning/building-the-data-science-driven-organization
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://www.w3schools.com/datascience/
- https://www.guru99.com/data-science-tutorial.html
- https://www.tutorialspoint.com/python_data_science/index.htm
- https://www.classcentral.com/course/data-science-crash-course-4392
- https://towardsdatascience.com/10-resources-to-learn-data-science-on-google-cloud-c19fb3033df5