Stochastic Growth: Becoming an Effective Data Scientist with Grit

In the course of three years, Doma has evolved into a top 10 provider of real estate settlement services, an outcome in no small part powered by patented and highly effective data science algorithms. This trajectory seems even more remarkable when I reflect on the fact that Allen Ko and I were recruited to Doma fresh out of a data science boot camp in the company’s infancy. We were the first two full-time data scientists at a company with fewer than 10 employees. 

Back in 2017, Allen and I had no formal software engineering training, no experience on data science teams, no knowledge of the title and escrow industry, and no idea how to put models into production. One thing working in our favor was our combined eight years of engineering experience – a foundation that eased the process of learning complex subjects. The company had an interim Head of Data Science to help guide the company’s high-level strategy, but it was left to Allen and me to execute the company’s machine learning vision.

Despite our lack of experience, we were able to lay down the initial data and model building proof-of-concepts that eventually grew into the humming data science ecosystem that now instantly generates title commitments on over 80 percent of orders, used by lenders across the country. The story of how I went from feeling wholly unqualified for the expectations set before me to contributing to this achievement is worth telling.

A thirst for learning as the medicine for uncertainty

During my first few months on the job, it was hard to shake the concerns I had about my day-to-day approach to data science. “Is this code efficient? Am I doing enough commenting and documentation? Do real data scientists even use Jupyter Notebooks?”

I wasn’t alone. Many of my fellow bootcamp graduates shared these feelings while making the transition to applied data science – even those who were hired into environments with extensive  mentorship and code infrastructure.

Compounding the challenges around optimal code development was the prospect of gaining real estate and insurance subject matter expertise. Building machine learning models that underwrite risk was going to require a deep understanding of the title insurance industry. We needed to become fluent in the legal jargon that filled the title commitments our company had acquired for model training.

An example of the industry specific language that pervaded our source material. 

To overcome these challenges, Allen and I fell back on our experience as engineers. We had transitioned to data science from the semiconductor and energy industries, respectively. Both fields involve complex subject matter expertise that is not easily discoverable. The following tactics made the biggest impact when applied to learning title and escrow:

  1. Leverage industry veterans: Whether it was through networking or hiring pipelines, we relentlessly connected with title experts to ask questions and challenge the status quo. We didn’t treat any single source as gospel, as it was commonplace to find contrasting answers and viewpoints. This process was instrumental in shaping our solution. It motivated a change in our approach from a holistic risk model to a suite of three models tailored to specific components of risk.
  2. Be scrappy: We considered imperfect information on the public domain and internal resources to broaden our perspective. This process was essential to asking better questions and becoming aware of potential pitfalls with our product.

Throughout this process of discovery and learning, staying humble helped us iterate quickly towards better solutions.

Having faith in a slow and steady growth process

What remained was a challenge that is common for all data science roles in any industry: striking the right balance between learning and executing. One of my favorite aspects of data science is the prolific amount of tutorials, blogs, videos, papers and coursework to consume. There are so many rabbit holes to go down, whether it’s understanding how transformers work, learning the nuances of various optimization algorithms, or reading about jaw-dropping applications of machine learning at tech giants. Such an abundance of public knowledge can be overwhelming, or even intimidating, in some contexts – hello 20-hour YouTube playlists of college lectures.

I find it hard to fully comprehend a technique without implementing it, and implementing new techniques just to test them is a grind. In that way, the opportunity cost of picking up new skills can feel significant. “Is this tool worth mastering? Will I ever use it again? How can I get an engineer to do it for me?”

One mid-2017 project required heavy doses of regular expressions, programmatic PDF parsing, and learning the ins-and-outs of managing permissions on S3 buckets (as great as many AWS products are, it is mind-boggling how incomprehensible their documentation was when trying to learn the command line interface).

These subjects aren’t at the top of any data science syllabus and devoting significant blocks of time to them made me feel like I was missing out on learning what I deemed to be “more important” concepts. This was a mindset I thankfully grew out of as I came to terms with the pace of skill development necessary to grow out of the data science function at Doma. “It’s not going to happen overnight, I don’t have to do it in the “right” order, and it’s not worth losing sleep over whether I’m doing too much or too little of it.”

Leveraging our strengths

No matter the rate of learning, Allen and I never would have achieved our goals without doubling down on our area of expertise – training supervised machine learning models. We dedicated ourselves to exploring as much property data as we could get our hands on, and then harnessing that data to train and evaluate risk models. This work yielded valuable short-term benefits, including confidence in the company’s plan to deliver instantly underwritten title commitments, and assessments of which data providers gave our models significant lift. 

At the same time, we were forgoing a number of “best practices.” Whether it was rudimentary GitHub usage, never writing test code, not using virtual environments, or being overly reliant on csv files stored on an EC2 instance, we were creating a litany of problems for ourselves down the road. As problematic as that sounds, it was really important to take the shortcuts that we did in order to enable the timely growth and development of our models.

Once we were able to build confidence in our loss ratio projections using our machine learning approach, we began chipping away at our backlog of tech debt. This process was accelerated by the hiring of Andy Mahdavi as our Chief Data Scientist in the fall of 2017. Coming from a leadership role at Capital One, where he directed mature data science teams, Andy vastly improved our prioritization of infrastructure improvements.

The functionality we tackled first included:

  • versioned feature engineering tables
  • configuration files that orchestrate model training scripts
  • GitHub branches that preserve deployed model iterations
  • automatically publishing visualizations of model performance

Allen and I had survived our initiation into data science and were off to the races towards delivering the first ever title commitment created with the assistance of machine learning models in February 2018.

Moving forward with confidence

That first year built so much clarity around what becoming a better data scientist feels like. As we embark into 2020 and beyond, our data team has ambitious goals – including my role in developing an expansive suite of natural language processing (NLP) models. I’ve never implemented a production-grade NLP model. Instead of being stressed or daunted by this situation, I know that the right mindset, level of effort, and talented coworkers will enable us to overcome the challenges that lie ahead… so I might as well enjoy the process.