A more technical post about how I end up efficiently JOINING 2 datasets with REGEX using a custom UDF in SPARK Context For the past couple of months I have been struggling with this small problem. I have a list of REGEX patterns and I want to know which WIKIPEDIA article contains them. What I…Continue Reading “Spark JOIN using REGEX”
Let’s not start with data science this time. Let’s start with psychology. I am far from having any competence in this domain, but I remember in high school being presented the Maslow’s hierarchy of needs. The best I can describe it is the different stage humans must go through to find happiness. To get better…Continue Reading “The data science pyramid”
Data Science is getting very popular and many people are trying to jump into the bandwagon, and this is GREAT. But many assume that data science, machine learning, plug any other buzzword here, is to plug data to some Sckit-Learn libraries. Here is what the actual job is.
To bring you into context, the following is happening after the data was collected. Don’t get me wrong, I don’t think it should be considered a simple step, but I would like to focus on data pre-processing and normalization.Continue Reading "This is what I really do as a Data Scientist"
A more efficient loss function for Siamese
I would like to share with you a small tool I have discovered this year which is very useful; Violin Plot!Continue Reading "Distribution’s 5th Symphony"