Data Science Era |OT| Desktop BI, Deep Learning and Everything in Between

Tebunker · Mar 7, 2019

Yeah it was a report designer, we do some analysis in our group now, but the nature of our structure and using Data Warehousing is different, and you can still do raw SQL pulls against it, we just happen to have several BI tools that make it so you don't have to. We can't teach our 5000 users SQL and most of them don't have the time. So we are reliant on doing good ETL and making sure we work hand in hand with the business to ensure we encompass all of their data needs. It is still a tried and true DW model.

A lot of the time our data marts are just massive relational star-schema DBs with tons of data and a lot of ways to dive in to it. Other times it is already very sliced up due to business rules.

I get what you are saying for sure though and I agree, you want to avoid a lot of assumptions. Especially going forward in analytics you don't want a lot of those presumptions. And when I say I have an ETL dev create a view, it is literally just, get me these four tables from DB A, these four from DB B and these 6 from DB C and give me these joins. It is what the guy does all day and he can turn around a MV real quick and then I can go in and pull in that data to my tool and do the actual analysis with no preconceived assumptions.

Also, just in general, I feel like the more and more I get exposed to new concepts and the way other companies operate in Data Science and Analytics, the more I think my current company isn't completely doing it right or completely understanding new concepts. I want to keep building and growing in this, but man it feels like they are missing the boat.

maxxpower · Mar 7, 2019

I'm questioning whether I should continue my solo learning path on data science. Over the past two years I've learned and practice enough data analysis and machine learning to get a job but I have no intention of getting a Master's or PhD nor do I have the time or money. It sucks because I love this field but I want to make a career out of it. I love how easy it is to learn on your own.

Pau · Mar 13, 2019

This past month has been a whirlwind.

I got rejected at one program, then accepted into three, and now I'm waiting on two more.

I honestly thought I would get accepted into one at most, so this is incredibly surprising. This is going to be such a hard decision to make, but the good kind of hard. I have until mid-April to decide!

Anyone have experience with or considering Masters in Data Science at Harvard, New York University, or University of Washington?

maxxpower said:
I'm questioning whether I should continue my solo learning path on data science. Over the past two years I've learned and practice enough data analysis and machine learning to get a job but I have no intention of getting a Master's or PhD nor do I have the time or money. It sucks because I love this field but I want to make a career out of it. I love how easy it is to learn on your own.

What's stopping you from applying to (and eventually taking on) data science jobs?

Totakeke · Mar 13, 2019

Definitely pick NYU if you want to do cutting edge machine learning like deep learning stuff. They have both the faculty (Yann LeCun) and the connections. Not sure about Harvard but program looked quite technical last time I looked at it. No idea about University of Washington.

maxxpower · Mar 13, 2019

Pau said:
What's stopping you from applying to (and eventually taking on) data science jobs?

I have the practical knowledge and experience but I feel like my lack of a formal education would essentially disqualify me from any data science position.

Totakeke · Mar 13, 2019

maxxpower said:
I have the practical knowledge and experience but I feel like my lack of a formal education would essentially disqualify me from any data science position.

The amount of people with "formal" data science education is pretty small, since most of the data science programs are pretty new. Most data scientists came from and are still coming from other fields. So if your experience or educational background had a good amount of exposure to statistics and programming, then you're in decent shape. Of course experience in this field are highly valued, but as long you're not picky and you really want to enter the field, it's definitely doable. You just have to think about how to showcase your interest in your resume to stand apart from all the other people also applying for the same job. That could involve talking about some Kaggle competition you attempted or other kinds of data science exploration you did by yourself.

On the flip side, data science in a lot of smaller and medium-sized companies still requires a lot of analytics skills (which isn't really taught by any course and education) and experimentation skills are often far more useful compared to machine learning skills. That fact may be something the companies don't even realize themselves as they attempt to build their data science teams. Can't argue about the wealth of opportunities that exist though, definitely worth it if you're interested unless your current field is already pretty nice and comfy.

Blu10 · Mar 15, 2019

maxxpower said:
I have the practical knowledge and experience but I feel like my lack of a formal education would essentially disqualify me from any data science position.

I suspect most teams are like my team, in that they have a wide variety of roles and levels. I hired a guy with no experience in enterprise analytic tools (he knew sql) into a junior role a couple years ago, and he has absolutely blossomed. He'll get his 3rd promotion this year, and will probably be my boss in under 5 years.

This is a path my team continues to follow today as junior positions open up. While it might not always be as successful as it was with that one guy, it is also a path I took when I joined the team. Don't count yourself out, just look for the right analytic position, on the right team, and you'll get your foot in the door.

Tebunker · Apr 9, 2019

Does anyone use Power Pivot, Solver or Power Query in Excel? I am going through a screening call and these are some of the skills they want, and I am just sitting here wondering why not just use something like Power BI, and yes I get that money and costs etc, but this is a somewhat large Credit Union, they could afford a modern BI tool.

disillusion386 · Apr 9, 2019

Tebunker said:
Does anyone use Power Pivot, Solver or Power Query in Excel? I am going through a screening call and these are some of the skills they want, and I am just sitting here wondering why not just use something like Power BI, and yes I get that money and costs etc, but this is a somewhat large Credit Union, they could afford a modern BI tool.

It's not so much the cost of the BI tool, but rather the cost of having to train your workforce on how to use that tool. It's annoying where I work too because people would rather use more limited software than adopt something that has a lot more capabilities.

Supreme Leader · Apr 13, 2019

What are some recommended resources for learning Power BI outside of the ones in the OT

Tebunker · Apr 13, 2019

Supreme Leader said:
What are some recommended resources for learning Power BI outside of the ones in the OT

Free resourcea are a little tougher to come by, there are several Lynda/Linkedin courses that are worth pursuing. I've just kind of felt like Power Bi's community hasn't quite as grown like some other tools and it makes getting a lot of community support harder.

That should be changing with more Adoption.

I believe if you have some power pivot/power query stuff it can be applied to power bi too.

Supreme Leader · Apr 13, 2019

Tebunker said:
Free resourcea are a little tougher to come by, there are several Lynda/Linkedin courses that are worth pursuing. I've just kind of felt like Power Bi's community hasn't quite as grown like some other tools and it makes getting a lot of community support harder.

That should be changing with more Adoption.

I believe if you have some power pivot/power query stuff it can be applied to power bi too.

Thanks for your input. I'll do a little more research

ieandrew · Apr 13, 2019

Graduating with M.S. Data Science next month and so glad this topic was created!

Pau · Apr 13, 2019

ieandrew said:
Graduating with M.S. Data Science next month and so glad this topic was created!

Congratulations! Any tips for people starting next fall? What were your highlights?

impingu1984 · Apr 15, 2019

Subbing to this thread... Didn't know it existed...

Working currently as a data scientist, been working in analytics for the past 7 years or so, knowledge of SQL and R and currently use Alteryx (the best piece of software I have ever used frankly) and PowerBI, have used tableau in the past as well.

May start learning python at some point.

BTW I have no degree at all simply have an aptitude for it and have managed demonstrate this to get the job I have now. So despite many places asking for one, personally I say you can get by without one.... Although it is the harder route no doubt.

impingu1984 · Apr 15, 2019

Supreme Leader said:
What are some recommended resources for learning Power BI outside of the ones in the OT

EnterpriseDNA has some good PowerBI resources, some of which is free...

Also learn DAX. It can be applied to powerpivot etc as well...

To do really cool stuff in PowerBI dax knowledge is essential.

Deleted member 6582 · Apr 15, 2019

Aside from some computational efficiencies is the any reason to not just use neural networks for everything? Like, are there real cases of more traditional methods yielding better accuracy anymore? I feel compute power is pretty much at the point I shouldn't worry about anything else.

Irnbru · Apr 15, 2019

Somnid said:
Aside from some computational efficiencies is the any reason to not just use neural networks for everything? Like, are there real cases of more traditional methods yielding better accuracy anymore? I feel compute power is pretty much at the point I shouldn't worry about anything else.

It might not be the right model for everything, depending on the item a ensemble method might yield better results while understanding the math better. I find neural networks to be very black box. Very powerful though.

Also, cost of compute power for large companies is still a very large thing

impingu1984 · Apr 15, 2019

Somnid said:
Aside from some computational efficiencies is the any reason to not just use neural networks for everything? Like, are there real cases of more traditional methods yielding better accuracy anymore? I feel compute power is pretty much at the point I shouldn't worry about anything else.

Firstly computational efficiency is a extremely important factor in a decision as to what kind of implementation solution to a problem you're going to use, I don't feel you can just set that aside.

I recently setup a naive Bayes classifier that takes a couple of mins to train from 2 million records and predict a binary outcome for 250k other records with 85% accuracy Vs a no info 55% accuracy, and it took a day to setup.

Could we get better results? possibly, but it offers good enough results in a short space of time and can be fully trained so quickly it is practically easy to run on a whim.

That being said a lack of data is a good reason to use a more traditional machine learning algorithm or technique, again my simple naive Bayes classifier works well with limited subsets of training data.

Also my naive Bayes classifier offers great insight into what attributes effect the outcome, even ones that aren't observed often, it's hard to observe this in a neural net

But it's extremely limiting to just use neural networks for everything... And sometimes the simple solutions work well enough. Don't limit yourself just because of the new sexy hotness...

erd · Apr 15, 2019

Somnid said:
Aside from some computational efficiencies is the any reason to not just use neural networks for everything? Like, are there real cases of more traditional methods yielding better accuracy anymore? I feel compute power is pretty much at the point I shouldn't worry about anything else.

I'm hardly an expert on the topic, but from my knowledge there's a bunch of stuff that might make neural networks sub-optimal. For example, neural networks don't tend to perform as well without a large amount of data, which can be a very real problem in a lot of cases. People are constantly coming up with ways to go around that, of course, but if you have some niche use case that hasn't been explored yet and don't have a lot of data, you might be better off just using another, off-the-shelf classifier. It will definitely be a lot less of a hassle.

Interpretability is another big reason. This doesn't matter if you're only interested in the raw accuracy, but there are a lot of cases where you also want to know exactly how your classifier arrived at a conclusion. Something like an AI to replace a judge, for instance, should be able to clearly explain why it arrived at the sentence it did. While there are ways to add interpretability to neural networks, they might not be enough in a lot of cases. Taking a hit to accuracy is justified in cases like that.

Another reason is that neural networks don't work equally well for all types of data. Convolutional neural networks work incredibly well on images since they are able to exploit information about pixel positions, and similarly, RNNs and LSTMs work well on text and sequences since they take into account the positions and distances between characters/points. On the other hand, something like tabular data isn't as easy to work with for them. From what I've seen written online, more traditional approaches like XGBoost and various ensembles still achieve great results on Kaggle competitions (and win quite often, apparently), which mostly have data like that.

You might also have to fall back on more traditional approaches when dealing with unlabelled data. If someone just throws some data at you, with no labels, and tells you to figure out what to do with it (which isn't an unrealistic scenario), just throwing that into a neural network will not tell you anything about the data.

They can also be a lot of work in cases. For example, for game AI it's much easier to just use non-neural-network based methods. AlphaZero is technically the best chess AI, but it took an incredible amount of effort to create and an eternity to train (most companies simply don't have the resources to do something like that in a reasonable time-frame). Meanwhile, a simple search-based approach is still enough to beat every human. So you likely won't be seeing stuff like that in every games any time soon, even if some companies are working on it.

From a research point of view, limiting ourselves to only neural networks also isn't the best idea. Despite their impressive results, they are still fundamentally flawed and likely won't lead to something like artificial general intelligence. Focusing on things that aren't quite as good now but might be better in the future should still be done. That's exactly what happened with neural networks as well: they went from a discarded piece of technology no-one wanted to use to the biggest hotness in AI because some people remained working on them.

Neural networks are still super cool though. The above examples probably aren't even 100% true. I'm sure you could find examples of NNs working well on limited data, or being perfectly interpretable on a given domain, or working super well on completely unstructured data, or stuff like that.

Totakeke · Apr 15, 2019

Eridani gave a pretty good set of answers. The realm of problems where deep learning should be the best solution is also simply a subset of all the problems you could potentially solve with machine learning. Within the requirement of large amounts of data for deep learning, it is also implicit that the problem space needs to be something that has the right answers that don't change much with time. A photo with a cat in it decades ago is still a photo with a cat today, and there's not much externalities that your dataset doesn't capture that might affect that conclusion. Also, the number of companies that have the data collection and technical resources to solve problems within that subset is also pretty small.

Not so much when you're predicting stock market trends, providing recommendations, or fighting fraud. There might be a lot of things that your dataset doesn't capture or it's just too much work to capture all the possible factors that might affect the results. Things always change and using less of your data may in fact provide better results. Overfitting is always a problem, there's seldom a right model for it and it's relatively easy to just try all sorts of different models before you rule any of them out.

Also data science projects tend to be an iterative process, your first attempt will often be far from the ideal solution. Having longer iteration times by using needlessly expensive models will just slow down your iterations. Also when you don't have interpretability and there are some obvious flaws with your model results, it becomes harder to diagnose why the model is failing the way that it is, or maybe you don't even know that your model has data leakage issues because people tend to not spend a lot of time looking through the model results.

Clay · Apr 20, 2019

I have a Master's degree in econ and I'm trying to get into data science. I've had Data Analyst positions in the past but I did very basic stuff, basically plotting time series of employment, educational attainment, and other demographic data. I also know some basic programming, most in Stata.

I took some high-level stats courses but I'm pretty rusty since I don't use it on a daily basis. I recently bought a few math review books (stats, linear algebra, calc) and I've been teaching myself Python, which is going well.

I currently work a couple part-time jobs that aren't data-related at all. I loved stats and working with data in school but after graduating the jobs I found basically amounted to creating simple graphics in Excel, which was extremely boring. I'd love to get back into working with data but in a more stimulating role. My worry is that my resume will look horrible to potential employers since I never took college classes about programming, machine learning, data scraping, and other concepts that seem to be key to Data Scientist positions.

I've been looking through the resources in the OP but I wonder whether there are any certificates or licenses that would be useful to have. I've seen there are different certificates you can earn to prove you know how to use Excel or whatever but I'm always skeptical about how impressive they are. Am I wrong about this? Would it be useful to pursue certificates that show I know Python, SQL, or any other relevant concepts/ skills?

Totakeke · Apr 21, 2019

Clay said:
I have a Master's degree in econ and I'm trying to get into data science. I've had Data Analyst positions in the past but I did very basic stuff, basically plotting time series of employment, educational attainment, and other demographic data. I also know some basic programming, most in Stata.

I took some high-level stats courses but I'm pretty rusty since I don't use it on a daily basis. I recently bought a few math review books (stats, linear algebra, calc) and I've been teaching myself Python, which is going well.

I currently work a couple part-time jobs that aren't data-related at all. I loved stats and working with data in school but after graduating the jobs I found basically amounted to creating simple graphics in Excel, which was extremely boring. I'd love to get back into working with data but in a more stimulating role. My worry is that my resume will look horrible to potential employers since I never took college classes about programming, machine learning, data scraping, and other concepts that seem to be key to Data Scientist positions.

I've been looking through the resources in the OP but I wonder whether there are any certificates or licenses that would be useful to have. I've seen there are different certificates you can earn to prove you know how to use Excel or whatever but I'm always skeptical about how impressive they are. Am I wrong about this? Would it be useful to pursue certificates that show I know Python, SQL, or any other relevant concepts/ skills?

Excel isn't something I would go for, it is either something really to pick up or it's only used because other people at the company is too entrenched in it to consider something else.

My usual advice for these kind of questions is to work backwards from what is the ideal job that you want to obtain. With your background in econ, it is possible that the work that you want to do involves more stata and excel. Go look at job postings and see what skills that they want you to have, and then go from there.

Clay · Apr 22, 2019

Totakeke said:
Excel isn't something I would go for, it is either something really to pick up or it's only used because other people at the company is too entrenched in it to consider something else.

My usual advice for these kind of questions is to work backwards from what is the ideal job that you want to obtain. With your background in econ, it is possible that the work that you want to do involves more stata and excel. Go look at job postings and see what skills that they want you to have, and then go from there.

Thanks!

Good advice, I'll look into some postings. Are there any certificates or licenses that are just generally good to have though?

Totakeke · Apr 22, 2019

Personally for someone with a light resume I would prefer hobby projects over a certificate. There's no equivalent to standardized IT certificates so the value of a certificate is pretty much tied to whether the people who are hiring you have been through the same programs. So if you really need to, just pick the popular ones, otherwise I don't think it's that valuable generally.

Edit: I wouldn't get certificates in Python/R/SQL necessarily, but it's definitely valuable to get a formal education in statistics, a/b testing, machine learning, and deep learning since that's harder to learn by yourself and know that you're doing it right. Again, which one is more important will depend on the job that you want to do.

Spliced-Up · Apr 23, 2019

I'm currently working as a data analyst / implementation specialist and I'm very interested in making data science the next step in my career. Was wondering what would be the best areas to focus on to make that transition. I have a background and degree (but from a for profit, basically worthless) in software development and I currently mainly work on writing queries, stored procedures, and some SSIS packages to migrate data from one system/format to another. As of now I have solid skill and experience with SQL and a variety of object oriented coding languages.

Should I start with learning Python and R? Or would I want to focus on the statistics and machine learning side of things first? I see it being recommended that I focus on doing my own projects instead of working towards certifications, so I would ideally like to start working towards that as quickly as possible.

Thanks for any input.

LakeShore · Apr 29, 2019

Been following this thread for a while, but figure I'd make a comment in here.

So on last weekends football results, I tried to predict the scores with the Poisson Distribution. I spent the Friday morning taking the variables for the premier league table. Total games played, total goals scored home and away.

Then created the attack strength and defence strength and the average home and away goals scored for opponents.

I had:

Liverpool 4 - Huddersfield 0 : 14.37% likelihood
Spurs 1 - West Ham 0: 16.53%
Crystal Palace 0 - Everton 1: 17.75%
Fulham 1 - Cardiff 0: 13.5%
Southampton 1 - Bournemouth 1: 9.82%
Watford 1 - Wolves 1: 13.40%
Brighton 0 - Newcastle 0: 19.18%

Well, that went terribly:

Liverpool 5-0
Spurs Lost 0-1
Palace and Everton 0-0
Fulham 1-0 (Woooooo)
Southampton 3 - Bournemouth 3 (so was still a draw, but not the expected goals frequency)
Watford lost 1-2 to Wolves
Brighton 1 - Newcastle 1 (so was another draw, but again, not expected goals frequency)

So if betting on results, would have got 4 correct (liverpool W, Fulham W, Southampton D, Brighton D), but actual correct scores, still some way off. But it was fun though.

Haselbacher · Apr 30, 2019

LakeShore said:
Been following this thread for a while, but figure I'd make a comment in here.

So on last weekends football results, I tried to predict the scores with the Poisson Distribution. I spent the Friday morning taking the variables for the premier league table. Total games played, total goals scored home and away.

Then created the attack strength and defence strength and the average home and away goals scored for opponents.

I had:

Liverpool 4 - Huddersfield 0 : 14.37% likelihood
Spurs 1 - West Ham 0: 16.53%
Crystal Palace 0 - Everton 1: 17.75%
Fulham 1 - Cardiff 0: 13.5%
Southampton 1 - Bournemouth 1: 9.82%
Watford 1 - Wolves 1: 13.40%
Brighton 0 - Newcastle 0: 19.18%

Well, that went terribly:

Liverpool 5-0
Spurs Lost 0-1
Palace and Everton 0-0
Fulham 1-0 (Woooooo)
Southampton 3 - Bournemouth 3 (so was still a draw, but not the expected goals frequency)
Watford lost 1-2 to Wolves
Brighton 1 - Newcastle 1 (so was another draw, but again, not expected goals frequency)

So if betting on results, would have got 4 correct (liverpool W, Fulham W, Southampton D, Brighton D), but actual correct scores, still some way off. But it was fun though.

I love this!
Can you explain some more, what you did and how?

I wanted to do something similar. But I think sports with more games, bigger sample size may be better. Like NBA or MLB.
Just from the data point of view.

But I think your predictions are not that bad!

HarryHengst · Apr 30, 2019

Haselbacher said:
I love this!
Can you explain some more, what you did and how?

I wanted to do something similar. But I think sports with more games, bigger sample size may be better. Like NBA or MLB.
Just from the data point of view.

But I think your predictions are not that bad!

This explains the process pretty well: https://help.smarkets.com/hc/en-gb/...ate-Poisson-distribution-for-football-betting

Also, for everyone interested in data science, you will have to pick up statistics and probability. To do that you need, among others, calculus. If you've never done it, or got nightmares from your college classes, the solution is Professor Leonard. He filmed his classes he teaches at some community college and he is the absolute best at explaining this stuff in a way that makes you go ''huh, so calculus isnt hard after all?!?!". He has full playlists for Calculus I-III, Statistics, and Algebra (in case you need to work on your pre-calculus stuff), and is currently working on a series on differential equations.

Tebunker · Apr 30, 2019

Anyone use Qlik in their jobs? I am moving to a company using Qlik view and Qlik Sense and I will be helping build user adoption and training for self service analytics while also developing reporting and analytics for the leadership.

I would have preferred a job using Tableau or Power Bi but this one has me excited because they want to help me learn python and more sql so I can fill in on the agile team

Totakeke · May 2, 2019

Just saw this today, never heard of them before, but they have a great article on issues with bias on modeling.

https://parametric.press/issue-01/the-myth-of-the-impartial-machine/

LakeShore · May 3, 2019

Haselbacher said:
I love this!
Can you explain some more, what you did and how?

I wanted to do something similar. But I think sports with more games, bigger sample size may be better. Like NBA or MLB.
Just from the data point of view.

But I think your predictions are not that bad!

Hey, Sorry for the late reply. as HarryHengst has commented above, I used this link as a guide to work out the figures. I did it all on excel - https://help.smarkets.com/hc/en-gb/...ate-Poisson-distribution-for-football-betting

NBA could be done perhaps? just there'd be a lot higher averages and attack / defence scores to apply. And as for the variable, I'd not know what the ceiling is in NBA, where as in football, one team scoring 6 goals a game would usually be the most. There must be some forms of adjusting this process to other sports though. I'd figure it'd work for NHL as they're similar to football with regards to scores per game right? Maybe on the odd occasion a team might score +6 goals?

fanboi · May 3, 2019

We are currently implementing Metabase in one of our projects I am running for generating reports and data for the company (and for the clients we work for as well).

The tool is open source and is incredible easy to use, and powerful.

Supreme Leader · May 16, 2019

Supreme Leader said:
What are some recommended resources for learning Power BI outside of the ones in the OT

Thanks for all the previous help with my Power BI questions.

While searching through Reddit I found this Microsoft certification - Analyzing and Visualizing Data with Power BI
https://www.reddit.com/r/PowerBI/comments/boode8/power_bi_certification_70778_microsoft_or_edx/

I will probably start studying for this exam tonight using the edx lesson plan and the goal to take the exam in 2-3 weeks. I'm passing this information along for anyone that would like to join me in studying or interest in learning Power BI.

Nacho Papi · Jun 11, 2019

Hi,

I was hoping someone could guide me as I take my first baby steps in the ML/AI world.

I'm also a bit embarrassed to say this but frankly I'm not even sure if the problem statement/scenario I had in mind is even suitable/applicable for an ML solution approach, so please forgive my ignorance.

I have spent a few months trying to understand the absolute basic ideas/concepts under the ML banner and wanted to try playing around with some hypothetical scenarios.

What I'm currently unsure about is, again, if this problem could even benefit from ML solutions but that's why I'm asking the questions, looking foolish be damned!

So, without further ado (please play along):

Imagine a restaurant owner, who possesses a vast, and varied, amount of data on his/her clients, wanting to gauge which of their characteristics/attributes account most towards the profitability of said clients.

These client metrics may include, but are not limited to:

Lifetime spend at the establishment
The country they are from
Their street address
Their postal code
Their bank
Their bank card type (silver/gold/hyperium)
Number of visits to the restaurant
Duration of visit
Price of items ordered
Cost of items ordered
Type of items ordered
Brand of their clothes (within reason, cheap/mid/high end)
The make and model of their cars
Etc, for another 10 dimensions/variables

So some fundamental questions I have are as follows...

Would ML models even be suitable here, where I would for example like to show:

How important each metric can be towards the profitability of a client? E.g. make and model of their car influences 20% of overall profitability whereas their brand of clothes only 2%.
How much more/less do clients from affluent spend during their visits?
Would it be worth it to make the floor space smaller by X tables to cater for Y more parking spaces if clients with cars (fancy or not) are disproportionately more profitable than those without?

And so forth…I'm sure there are other, better, comparisons to make but I'm just spit-balling here.

Could all these variables be considered 'equal', or does one need to perform some form of dimensionality reduction (PCA?) before feeding them into whatever prospective model you may have?
Speaking of models…if you want to compare 1 out of N variables against all N metrics (in this case, life time spend versus all other metrics mentioned) where does one start? What would a basic NN look like where that is what you're aiming to do (that is, compare how much influence/impact each of your N variables have on your one key/focus metric)?

I'm so sorry for the stupid questions, as I'm reading through it I realize my extremely tenuous grasp of it all but god damn I want to start somewhere and I have the big data to use to learn it with...

Please let me know if my ramblings don't make sense ERA, I'll try my best to clarify my intentions and/or blockers.

Any help is appreciated.

King Picollo · Jun 11, 2019

Well isn't this the thread i never knew i needed, thanks for the resource suggestions i will hope this can turn me from a data monkey to something more useful.

Vic Damone Jr. · Jun 11, 2019

Any Data Scientists or Analysts here? What does your day to day work look like? I'm close to finish a BS in Computer Science and I'm trying to pin down on a career.

ieandrew · Jun 11, 2019

Nacho Papi said:
Imagine a restaurant owner, who possesses a vast, and varied, amount of data on his/her clients, wanting to gauge which of their characteristics/attributes account most towards the profitability of said clients.

These client metrics may include, but are not limited to:

Lifetime spend at the establishment

The country they are from

Their street address

Their postal code

Their bank

Their bank card type (silver/gold/hyperium)

Number of visits to the restaurant

Duration of visit

Price of items ordered

Cost of items ordered

Type of items ordered

Brand of their clothes (within reason, cheap/mid/high end)

The make and model of their cars

Etc, for another 10 dimensions/variables

So some fundamental questions I have are as follows...

Would ML models even be suitable here, where I would for example like to show:

How important each metric can be towards the profitability of a client? E.g. make and model of their car influences 20% of overall profitability whereas their brand of clothes only 2%.

How much more/less do clients from affluent spend during their visits?

Would it be worth it to make the floor space smaller by X tables to cater for Y more parking spaces if clients with cars (fancy or not) are disproportionately more profitable than those without?

And so forth…I'm sure there are other, better, comparisons to make but I'm just spit-balling here.

Could all these variables be considered 'equal', or does one need to perform some form of dimensionality reduction (PCA?) before feeding them into whatever prospective model you may have?

Speaking of models…if you want to compare 1 out of N variables against all N metrics (in this case, life time spend versus all other metrics mentioned) where does one start? What would a basic NN look like where that is what you're aiming to do (that is, compare how much influence/impact each of your N variables have on your one key/focus metric)?

I'm so sorry for the stupid questions, as I'm reading through it I realize my extremely tenuous grasp of it all but god damn I want to start somewhere and I have the big data to use to learn it with...

Please let me know if my ramblings don't make sense ERA, I'll try my best to clarify my intentions and/or blockers.

Any help is appreciated.

Interesting scenario, and good questions. You mention a basic NN specifically, but that might not be the best direction to go for the insight you hope to gain. A NN has numerous weights and biases that it learns by way of optimization, and those w's and b's are representative of the NN learning what 'features' make for the best predictors of your response variable (profit). But it's not the easiest thing to do to associate those w's and b's with your distinct input metrics. A truly basic NN could be set up to basically replicate Logistic Regression, and in that case it would be possible to look at the weight matrix and see which metrics have larger or smaller values.

Gradient boosting (e.g. LightGBM) would be an option where your model would learn decision trees which best predict profit, and then you can view 'Feature Importance' and it will tell you which features were most important to the model (in LightGBM's case, it will give you whole numbers indicating the number of times each feature was used in a 'split' in the tree).

Going even simpler, you could use a LASSO regression model. You'll need to one-hot encode your categorical variables. There are several assumptions about your data (such as the presence of multicollinearity) that you'd want to check, but skip to the end and what that model will do is drive unimportant features toward '0' and you will be left with features that have weights which indicate they are a) relevant to the prediction and b) how might weight/how significant they are to the prediction.

For some of the specific questions you have, like the bolded: look into SHAP values.

Dimensionality reduction is not necessary here. But if you went that route you'd need to recompose your original features before you could ever gain answers to your questions, as PCA disguises them.

FantaSoda · Jun 11, 2019

impingu1984 said:
EnterpriseDNA has some good PowerBI resources, some of which is free...

Also learn DAX. It can be applied to powerpivot etc as well...

To do really cool stuff in PowerBI dax knowledge is essential.

DAX drives me crazy because (to my knowledge) there is no equivalent to IF THEN ELSE without doing nested IF statements.

Deleted member 8257 · Jun 11, 2019

jred250 said:
DAX drives me crazy because (to my knowledge) there is no equivalent to IF THEN ELSE without doing nested IF statements.

You can do SWITCH(). Saved me a ton of nested ifs. Good thing is SWITCH() allows other DAX functions in it as well, even Ifs if you want to fine grain.

FantaSoda · Jun 11, 2019

RustyNails said:
You can do SWITCH(). Saved me a ton of nested ifs. Good thing is SWITCH() allows other DAX functions in it as well, even Ifs if you want to fine grain.

A lot of times I'm doing logic on two fields (and sometimes more). Maybe I'm not using SWITCH to it's fullest potential, but I thought it was dependent on the value of a single field.

maxxpower · Jun 11, 2019

Like I've mentioned before. I've been teaching myself data science for the past two years and I've gotten really good and would like to make a career out of it. I truly enjoy doing this. I feel however, that most important data science problems have already been solved and I don't know what a novice like myself can do to contribute in the field. Additionally, will data scientists even exist in a decade from now?

Pau · Jun 11, 2019

The Last Wizard said:
Any Data Scientists or Analysts here? What does your day to day work look like? I'm close to finish a BS in Computer Science and I'm trying to pin down on a career.

I'm a data analyst for a social science research organization. My day to day tasks are the following:

Processing and cleaning raw data from various studies and administrative records
Restructuring data so that it's ready for analysis
Running descriptive statistics, presenting them to the researchers
Running impact models to determine whether or not an intervention had any effect, presenting them to the researchers
Creating tables and graphs for reports
Communicating with the people that provide us the data, going over discrepancies or weird trends that we find
Consulting with the people collecting data to make sure that they are getting usable data (e.g., helping create surveys, asking for the right data so that we can match subjects across our records)
Documenting everything so that the analysis is replicable
Helping other people with their programming

I spend most of my time on the first two bullets, but it really depends on where we are on a project.

I work almost exclusively in SAS. The programming and statistics is all pretty simple. In part because the data doesn't require anything too out there. But also because researchers and analysts in this field aren't typically trained at all in programming or computer science and are only minimally trained in statistics (probably a little more than an introductory course).

Nacho Papi · Jun 11, 2019

ieandrew said:
Interesting scenario, and good questions. You mention a basic NN specifically, but that might not be the best direction to go for the insight you hope to gain. A NN has numerous weights and biases that it learns by way of optimization, and those w's and b's are representative of the NN learning what 'features' make for the best predictors of your response variable (profit). But it's not the easiest thing to do to associate those w's and b's with your distinct input metrics. A truly basic NN could be set up to basically replicate Logistic Regression, and in that case it would be possible to look at the weight matrix and see which metrics have larger or smaller values.

Gradient boosting (e.g. LightGBM) would be an option where your model would learn decision trees which best predict profit, and then you can view 'Feature Importance' and it will tell you which features were most important to the model (in LightGBM's case, it will give you whole numbers indicating the number of times each feature was used in a 'split' in the tree).

Going even simpler, you could use a LASSO regression model. You'll need to one-hot encode your categorical variables. There are several assumptions about your data (such as the presence of multicollinearity) that you'd want to check, but skip to the end and what that model will do is drive unimportant features toward '0' and you will be left with features that have weights which indicate they are a) relevant to the prediction and b) how might weight/how significant they are to the prediction.

For some of the specific questions you have, like the bolded: look into SHAP values.

Dimensionality reduction is not necessary here. But if you went that route you'd need to recompose your original features before you could ever gain answers to your questions, as PCA disguises them.

Thank you so much for your input ieandrew, I appreciate it - I will follow your guidance and look closer at the methods you mentioned.

impingu1984 · Jun 11, 2019

The Last Wizard said:
Any Data Scientists or Analysts here? What does your day to day work look like? I'm close to finish a BS in Computer Science and I'm trying to pin down on a career.

My main tasks are as follows:

Data wrangling and ETL.. in my current job we use alteryx for this which is easy mode... seriously everyone should use alteryx you can even run python, R, CMD in workflows..

Data analysis - again looking for trends, patterns answering questions, finding questions to ask, use all kinds of tools for this, alteryx, sql scripting, R excel, power bi, tableau.

Predictive analytics - I personally use R for this.. can we predict things better... Usually results in a alteryx workflow with R scripts in the workflow.

Presenting data - have used tableau in the past for dashboards, now use powerbi..

Creating reporting solutions - basically ties into all the above but overarching theme is that everything I do has to be delivered in a way end users can self service and it can be automated. Alteryx makes the automation super easy and also makes ETL from every possible source of data you can think of easy as well. Powerbi makes it easy to present this data in a interactive way for the end user.

Keeping this in mind we must deliver self service and automation really changes how you develop analysis... We do ad-hoc analysis but still generally follow a similar process so it's easy to repeat.

It's an extremely broad job role. In some companies it maybe split into analysts (structured data), data scientist (unstructured data and predictive analytics) and data engineer / data warehouse developer. But I'm just a jack of all trades in this regard.

Gazele · Jun 11, 2019

My daily tasks change but I'd say on average its:
60% programming (prototyping, testing, optimization) - My company uses Python
30% ETL, data wrangling, EDA - SQL and Jupyter notebooks using Python
10% Meetings

Dr Frasier Crane · Jun 12, 2019

The Last Wizard said:
Any Data Scientists or Analysts here? What does your day to day work look like? I'm close to finish a BS in Computer Science and I'm trying to pin down on a career.

The morning so far:
Log into AWS as my previous days login would have expired so I can get it out the way
Log onto our work VPN
Check the Markov Modelling I'd been running when I went to bed had finished, it has not. Decided to trash it because I've got better things to do with my life.
Refactor some code (we're both an R and Python shop but I use R), splitting the munging from the modelling.
Realise that the reason the above Markov model was taking so long was because I'd left an option as "," when it should be ">".
Facepalm and fix it, it now takes like 5 seconds.
This also fixes an error I was attributing to RccpArmadillo but is actually just my own moron-ness.
Start pissing around with some network type visualisations, but they're not massively helpful.

The afternoon will probably be moving some data around so I can stop using EC2 and get data onto my local laptop, more buggering around with network visualisations. Then maybe redo a bunch of the data pipeline from dplyr/data.table code into pure (BigQuery) SQL. Depends how I'm going. However my git repo is getting a bit messy and having talked the talk about comments and documentation at a workshop yesterday I should probably walk the walk and do some housekeeping.

At the same time I'll be scumming through twitter to pick up hints, tips, articles, and blog posts to stay on top of the game.

Greggs for lunch, what a time to be alive.

backpropaganda · Jun 12, 2019

Anyone ICML this week or CVPR next week?

signal · Jul 11, 2019

https://blog.dominodatalab.com/data-science-at-the-new-york-times/

jamesandy · Jul 25, 2019

Seeing interest in NLP rising and now more and more job listings mentioning it. Anyone here with any directions how to go about learning it? Thanks.

Data Science Era |OT| Desktop BI, Deep Learning and Everything in Between

Attempted to circumvent ban with alt account

Self-Appointed Godmother of Bruce Wayne's Children

Attempted to circumvent ban with alt account

Self-requested permanent ban

Self-Appointed Godmother of Bruce Wayne's Children

Self-Requested Temporary Ban

Self-requested permanent ban

Deleted member 8257

Attempted to circumvent ban with alt account

Self-Appointed Godmother of Bruce Wayne's Children