What is Data Science?
It's an evolving field and this could be a topic in itself, so I'll defer to UC Berkeley here:
https://datascience.berkeley.edu/about/what-is-data-science/
"The term "data scientist" was coined as recently as 2008 when companies realized the need for data professionals who are skilled in organizing and analyzing massive amounts of data."
"Effective data scientists are able to identify relevant questions, collect data from a multitude of different data sources, organize the information, translate results into solutions, and communicate their findings in a way that positively affects business decisions. These skills are required in almost all industries, causing skilled data scientists to be increasingly valuable to companies."
"The image represents the five stages of the data science life cycle: Capture, (data acquisition, data entry, signal reception, data extraction); Maintain (data warehousing, data cleansing, data staging, data processing, data architecture); Process (data mining, clustering/classification, data modeling, data summarization); Analyze (exploratory/confirmatory, predictive analysis, regression, text mining, qualitative analysis); Communicate (data reporting, data visualization, business intelligence, decision making)."
There's a lot more at the link.
I'm starting from the data warehousing side and have started to grow into data engineering, working with Hadoop and a lot of Python. Rather than trying to cover everything, I'll post what I've found useful personally to get things started and will add any other links people would like to include (e.g. for the R side of things, BI tools, etc).
Resources
Andrew Ng's free machine learning course at Coursera
https://www.coursera.org/learn/machine-learning
SIGGRAPH Deep Learning Crash Course
https://youtu.be/r0Ogt-q956I
MIT's intro to deep learning 2019 session: http://introtodeeplearning.com/
2018 archive: http://introtodeeplearning.com/2018/index.html
Calculus basics by 3Blue1Brown (re: prerequisites for MIT course above)
https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr
Probability and statistics
https://seeing-theory.brown.edu/
Learning SQL
Datacamp: https://www.datacamp.com/courses/intro-to-sql-for-data-science
Udemy: https://www.udemy.com/70-461-session-2-querying-microsoft-sql-server-2012/
Kaggle
https://www.kaggle.com/
Companies post paid competitions here and people compete for the best solutions. There are lots of resources here, including exercises and examples to learn from, and a large community. XGBoost has been part of many winning solutions here. Surprisingly it's an ensemble approach rather than deep learning, but it happens to be well suited to the nature of most challenges.
Python-specific machine learning resources
The Complete Machine Learning Course with Python
https://www.udemy.com/machine-learning-course-with-python/
Good focus on machine learning, but assumes some familiarity with Python. Covers some of the same content as Andrew Ng's free course at Coursera. I like this one better because it's newer, has better quality video and focuses more on implementation with examples.
DataQuest
http://dataquest.io
Free hands-on introduction using Python. Has an optional subscription with more content and access to support.
Codecademy data science path:
https://www.codecademy.com/learn/paths/data-science
Subscription required for this content. The site also has lots of free resources for general development.
Data Analysis with Pandas and Python
https://www.udemy.com/data-analysis-with-pandas/
General Python:
The Modern Python 3 Bootcamp
https://www.udemy.com/the-modern-python3-bootcamp/
Python Cookbook
https://www.amazon.com/Python-Cookbook-Third-David-Beazley/dp/1449340377
Advanced material (modern patterns, aimed at programmers)
R
R for Data Science
https://r4ds.had.co.nz/
The Tidyverse packages:
https://www.tidyverse.org/packages/
Spark
(Pending)
Youtube
Google has been putting out a ton of content for TensorFlow lately:
https://www.youtube.com/channel/UC0rqucBdTuFTjJiefW5t-IQ/videos
Siraj Raval: https://www.youtube.com/channel/UCWN3xxRkmTPmbKwht9FuE5A/videos
Has produced many compact videos on interesting topics around this space, for example some creative applications of generative adversarial networks (GANs).
3Blue1Brown's deep learning series, focusing on the math behind neural networks
https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
Enthought: https://www.youtube.com/user/EnthoughtMedia/videos
Presentations from the SciPy conferences.
Tools
Anaconda
https://www.anaconda.com/
A popular Python distribution which preinstalls many common libraries and tools like Jupyter Notebook.
Power BI
https://powerbi.microsoft.com/
A free desktop program which lets you import and visualize data from various sources.
Visual Studio Code
https://code.visualstudio.com/
A great free IDE with plugins for many common languages.
Enterprise BI Platforms
Tableau, Business Objects, Power BI (enterprise), Spotfire, SAP Analytics Cloud, Qlikview, Alteryx...
RapidMiner
https://rapidminer.com
Platform focused on data science. Has a free educational license.
Cool Threads
"I trained an AI on tens of thousands of ResetEra post titles and discovered how the world ends"
https://www.resetera.com/threads/82679/
It's an evolving field and this could be a topic in itself, so I'll defer to UC Berkeley here:
https://datascience.berkeley.edu/about/what-is-data-science/
"The term "data scientist" was coined as recently as 2008 when companies realized the need for data professionals who are skilled in organizing and analyzing massive amounts of data."
"Effective data scientists are able to identify relevant questions, collect data from a multitude of different data sources, organize the information, translate results into solutions, and communicate their findings in a way that positively affects business decisions. These skills are required in almost all industries, causing skilled data scientists to be increasingly valuable to companies."
"The image represents the five stages of the data science life cycle: Capture, (data acquisition, data entry, signal reception, data extraction); Maintain (data warehousing, data cleansing, data staging, data processing, data architecture); Process (data mining, clustering/classification, data modeling, data summarization); Analyze (exploratory/confirmatory, predictive analysis, regression, text mining, qualitative analysis); Communicate (data reporting, data visualization, business intelligence, decision making)."
There's a lot more at the link.
I'm starting from the data warehousing side and have started to grow into data engineering, working with Hadoop and a lot of Python. Rather than trying to cover everything, I'll post what I've found useful personally to get things started and will add any other links people would like to include (e.g. for the R side of things, BI tools, etc).
Resources
Andrew Ng's free machine learning course at Coursera
https://www.coursera.org/learn/machine-learning
SIGGRAPH Deep Learning Crash Course
https://youtu.be/r0Ogt-q956I
MIT's intro to deep learning 2019 session: http://introtodeeplearning.com/
2018 archive: http://introtodeeplearning.com/2018/index.html
Calculus basics by 3Blue1Brown (re: prerequisites for MIT course above)
https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr
Probability and statistics
https://seeing-theory.brown.edu/
Learning SQL
Datacamp: https://www.datacamp.com/courses/intro-to-sql-for-data-science
Udemy: https://www.udemy.com/70-461-session-2-querying-microsoft-sql-server-2012/
Kaggle
https://www.kaggle.com/
Companies post paid competitions here and people compete for the best solutions. There are lots of resources here, including exercises and examples to learn from, and a large community. XGBoost has been part of many winning solutions here. Surprisingly it's an ensemble approach rather than deep learning, but it happens to be well suited to the nature of most challenges.
Python-specific machine learning resources
The Complete Machine Learning Course with Python
https://www.udemy.com/machine-learning-course-with-python/
Good focus on machine learning, but assumes some familiarity with Python. Covers some of the same content as Andrew Ng's free course at Coursera. I like this one better because it's newer, has better quality video and focuses more on implementation with examples.
DataQuest
http://dataquest.io
Free hands-on introduction using Python. Has an optional subscription with more content and access to support.
Codecademy data science path:
https://www.codecademy.com/learn/paths/data-science
Subscription required for this content. The site also has lots of free resources for general development.
Data Analysis with Pandas and Python
https://www.udemy.com/data-analysis-with-pandas/
General Python:
The Modern Python 3 Bootcamp
https://www.udemy.com/the-modern-python3-bootcamp/
Python Cookbook
https://www.amazon.com/Python-Cookbook-Third-David-Beazley/dp/1449340377
Advanced material (modern patterns, aimed at programmers)
R
R for Data Science
https://r4ds.had.co.nz/
The Tidyverse packages:
https://www.tidyverse.org/packages/
Spark
(Pending)
Youtube
Google has been putting out a ton of content for TensorFlow lately:
https://www.youtube.com/channel/UC0rqucBdTuFTjJiefW5t-IQ/videos
Siraj Raval: https://www.youtube.com/channel/UCWN3xxRkmTPmbKwht9FuE5A/videos
Has produced many compact videos on interesting topics around this space, for example some creative applications of generative adversarial networks (GANs).
3Blue1Brown's deep learning series, focusing on the math behind neural networks
https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
Enthought: https://www.youtube.com/user/EnthoughtMedia/videos
Presentations from the SciPy conferences.
Tools
Anaconda
https://www.anaconda.com/
A popular Python distribution which preinstalls many common libraries and tools like Jupyter Notebook.
Power BI
https://powerbi.microsoft.com/
A free desktop program which lets you import and visualize data from various sources.
Visual Studio Code
https://code.visualstudio.com/
A great free IDE with plugins for many common languages.
Enterprise BI Platforms
Tableau, Business Objects, Power BI (enterprise), Spotfire, SAP Analytics Cloud, Qlikview, Alteryx...
RapidMiner
https://rapidminer.com
Platform focused on data science. Has a free educational license.
Cool Threads
"I trained an AI on tens of thousands of ResetEra post titles and discovered how the world ends"
https://www.resetera.com/threads/82679/
Last edited: