How an online real estate company optimized its Hadoop clusters

San Francisco-based online residential real estate company Trulia lives and dies by data. To compete successfully in today's housing market, tt must deliver the most up-to-date real estate information available to its customers. But until recently, doing so was a daily struggle.

Acquired by online real estate database company Zillow in 2014 for $3.5 billion, Trulia is one of the largest online residential real estate marketplaces around, with more than 55 million unique site visitors each month.

With so much data to store and process, the company adopted Hadoop in 2008 and it has since become the heart of Trulia's data infrastructure. The company has expanded usage of Hadoop to an entire data engineering department consisting of several teams using multiple clusters. This allows Trulia to deliver personalized recommendations to customers based on sophisticated data science models that analyze more than a terabyte of data daily. That data is drawn from new listings, public records and user behavior, all of which is then cross-referenced with search criteria to alert customers quickly when new properties become available.

To make it all work, throughout each night, the company must complete dozens of workflows and hundreds of complex jobs on time. With many teams writing Hadoop jobs or using Hive or Spark concurrently, Trulia has to ensure reliability in its multi-tenant, multi-workload environment. Delayed or unpredictable jobs throw a wrench in the works and can seriously affect the bottom line. Until recently, that meant Trulia had to intentionally underutilize its Hadoop clusters to ensure jobs completed on time.

"We process, on a daily basis, over a terabyte of new information: public records, listings, user activity," says Zane Williamson, senior director of DevOps at Trulia. "We process this data across multiple Hadoop clusters and use the information to send out email and push notifications to our users. That's the lead driver to get users back to the site and interacting. It's very important that it gets done in a daily fashion. Reliability and uptime for the workflows is essential."

"It's been a pretty painful process, I think," Williamson adds, noting that he joined Trulia relatively recently. "It's been a pretty big challenge to reliably run this data cycle, maintain uptime and troubleshoot issues.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

What’s Driving Python’s Massive Popularity?

27 Oct, 2021

Earlier this month, Python moved into the number one slot in the TIOBE Index, marking the first time in 20 …

Read more

Looker + Google: How The Deal Will Rock The BI World

11 Jun, 2019

Even with heavy investments and continued innovation, Google’s cloud business remains a disappointing No. 3 in the market. But the …

Read more

Optimising artificial intelligence in the pharmaceutical industry

14 Dec, 2019

Artificial intelligence (AI) has wide-reaching potential within the pharmaceutical industry, from clinical trials to marketing and sales analytics. Using a …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.