Lucene/Solr Revolution2016: Full Schedule

10:00am EDT

Leveraging the power of Solr with Spark

Solr is a distributed NoSQL database with impressive search capabilities. Spark is the new megastar in the distributed computing universe. In this code-intense session we show you how to combine both to solve real-time search and processing problems. We will show you how to set up a Solr/Spark combination from scratch and develop first jobs with runs distributed on shared Solr data. We will also show you how to use this combination for your next-generation BI platform.

Speakers

Johannes Weigend

CTO, QAware GmbH

Johannes works as a software architect with Java since 1999 and was honoured as "Java Rockstar" at JavaOne 2015. He is a lecturer at the University of Applied Sciences in Rosenheim, Germany and technical director at QAware, a decorated software engineering company located in Munich... Read More →

Thursday October 13, 2016 10:00am - 10:40am EDT
Commonwealth Sheraton Boston

Data Science

Skill Level Intermediate

10:00am EDT

Why is my Solr slow?! (An HTrace Case Study)

Attempting to diagnose distributed system slowness can be one of the most challenging and headache inducing activities for an operations team. Armed with a suite of low level metrics and host monitoring, the first and only option often available is a lengthy process of elimination of each component in the hardware and software stack.

In this session we will learn how to instrument Solr to send distributed tracing data to HTrace. We will then look at some sample traces and learn how to identify slowness in different parts of the entire stack, looking for trends and outliers in the Solr operation. We will complete the session by discussing how to add tracing to your existing client applications for true end-to-end visibility into the performance of your cluster.

Speakers

Mike Drob

Software Engineer, Cloudera

Mike has been immersed in Big Data for over 5 years, previously with the US Government and now with Cloudera. His current role is to provide operational support for Apache Solr, a world-class search engine built on top of Apache Lucene. He is also a hobbyist contributor to several... Read More →

Thursday October 13, 2016 10:00am - 10:40am EDT
Gardner

Ecosystem

Skill Level Intermediate

10:00am EDT

Hidden Gems of Apache Solr

Every day billions of documents are searched, sorted, faceted and highlighted by millions of users who have no idea that behind the scenes, Apache Solr is hard at work, making life simple for developers like you. But what else can Solr do for you?

In this session, we'll dive into some of the less well known, less understood, features of Apache Solr that even seasoned Solr developers may not be aware of -- features that can be useful in ways you might not have considered even if you do know about them, so you can take your Solr powered applications to the next level.

Speakers

Chris Hostetter

Software Engineer, Lucidworks

Chris 'Hoss' Hostetter is a Member of the Apache Software Foundation, and a committer on the Lucene/Solr Project. Prior to joining Lucidworks in 2010 to work full time on Solr development, he spent 11 years as a Principal Software Engineer for CNET Networks thinking about searching... Read More →

Thursday October 13, 2016 10:00am - 10:40am EDT
Back Bay B Sheraton Boston

Exploring Solr

Committer Talk Committer Talk
Skill Level Intermediate

10:00am EDT

Solr Highlighting at Full Speed

Searching over a large corpus of legal documents brings about a number of unique challenges in search. In legal search, recall matters. Users often enter broad queries and leverage digests to help them determine the relevancy of a result before committing to reading a long document. This has made highlighting quality and speed with minimal memory use a key requirement for Bloomberg Law. In this talk, attendees will learn about Bloomberg Law's efforts to improve highlighting performance dramatically via the introduction of a new highlighter for Solr that uses your index to the best of its advantage.

Speakers

Timothy Rodriguez

Team Lead, Bloomberg

Timothy Rodriguez leads the Verticals Search Platform team at Bloomberg which provides the underlying search platform for several Bloomberg products in the areas of law, government, and new energy finance. He works on a number of areas related to search such as query grammars, distributed... Read More →

David Smiley

Search Developer & Consultant, D W Smiley LLC

David Smiley is a well recognized Apache Lucene/Solr expert. He wrote the first book on Solr (currently in 3rd edition), he's a Lucene/Solr committer and PMC member that improves Lucene and Solr, he speaks at conferences about it, he does training, and he offers part time independent... Read More →

Thursday October 13, 2016 10:00am - 10:40am EDT
Independence Sheraton Boston

Use Case

Committer Talk Committer Talk
Skill Level Intermediate

10:50am EDT

Anyone Can Build a Recommendation Engine with Solr

You don't need a PhD to get started with recommenders! You just need Solr! In this talk, you'll get several examples of building different recommendation strategies on top of Solr. You'll see how to deliver recommendations using user behavior, and how to combine that with content-specific signals. We'll cover:

- Folks who purchased this products also purchased
- Personalized recommendations based on past browsing history
- How Solr makes tuning relevance and scaling straight-forward

We'll also touch on many of the classic problems with recommenders, including the cold start problem and the Oprah Book Club problem. Come if you've got some Solr experience and would like to learn to build a recommender!

Speakers

Doug Turnbull

Chief Technical Officer, OpenSource Connections

Search relevance consultant. Author of Relevant Search. Doug crafts search/recommendation solutions that “get” users. To do this, Doug uses Solr, sprinkling a little natural language processing and machine learning on top for good measure. Through writing and speaking Doug wants... Read More →

Thursday October 13, 2016 10:50am - 11:30am EDT
Back Bay B Sheraton Boston

Exploring Solr

Skill Level Intermediate

10:50am EDT

Loading 350M documents into a large Solr cluster in 8 hours or less

This session is a Case Study that shows you how a large set of xml documents can be loaded into a multi-collection Solr cluster in a fast, efficient and controlled way.

The presenter will show how Solr is used within his organization and then explains how his team started out with loading content into their SolrCloud using the standard post.jar tool, which has some concealed limitations.

You will see how this led to their current solution that exists of multiple cloud-aware "content posting" worker-processes, controlled by a clever master-less queuing system in ZooKeeper. Also, the presenter will cover how to load content into a busy Solr cluster, without affecting the response times of running queries too much.

Speakers

Dion Olsthoorn

Senior Software Engineer, Wolters Kluwer

Dion Olsthoorn works as a Software Engineer for Wolters Kluwer, a publisher for professional content. He’s currently working on Ovid®, an online information delivery platform for medical research, were he and his team are responsible for building, enhancing and maintaining a large... Read More →

Thursday October 13, 2016 10:50am - 11:30am EDT
Back Bay A Sheraton Boston

Use Case

Skill Level Intermediate

11:40am EDT

Building a Solr Continuous Delivery Pipeline With Jenkins

In this session, I will demonstrate how to build a secure continuous delivery pipeline for Solr using Jenkins and various Jenkins plugins using the installation scripts that are packaged with Solr. I'll cover how to (optionally) build Solr and deploy it using Solr's own scripts and why one might want to do this. I'll cover the fundamentals of continuous delivery and what a CD pipeline looks like using Jenkins plugins. Finally, I'll will discuss the files comprising the Solr configuration that should be version controlled separately from Solr itself and how to configure various environments using core properties.

Speakers

James Strassburg

Senior Software Architect, Direct Supply

Jim Strassburg is an experienced software engineer, architect, and researcher. He has been building distributed software systems for over 15 years. In late 2012 he replaced the search engine for his company's e-commerce application with Apache Solr and got bit by the search bug. Lately... Read More →

Thursday October 13, 2016 11:40am - 12:20pm EDT
Gardner

Ecosystem

Skill Level Intermediate

11:40am EDT

SolrCloud: High Availability and Fault Tolerance

Committer Mark Miller will discuss the current SolrCloud architecture for handling disaster and recovery. This talk will cover how SolrCloud was designed to protect your data in the face of failure, some of the growing pains the system has gone through, and what is left to do in the near future when it comes to fault tolerance and recovery. Learn about the low level details that help keep your data safe as well as what choices and decisions you should make as a SolrCloud user that cares about data integrity.

Speakers

Mark Miller

Software Engineer, Cloudera

Mark Miller is a Lucene / Solr committer and Apache member. After starting with Lucene in 2006, Mark has spent most his time getting paid to work on the open source software projects that he loves. Mark has given many talks on Lucene/Solr at various conferences and meet-ups around... Read More →

Thursday October 13, 2016 11:40am - 12:20pm EDT
Independence Sheraton Boston

Exploring Solr

Committer Talk Committer Talk
Skill Level Intermediate

11:40am EDT

PlayStation and Lucene: Indexing 1 Million documents per second on 18 servers

What if I tell you that PlayStation4 is a not just a gaming console? What if I tell you that the PlayStation Network is a system that handles more than 70 million active users? What if I tell you that in order to create an awesome gaming experience, we support personalized search at scale? Finally, what if I tell you that the system that provides this personalized experience currently indexes up to 1 million documents per second using Lucene and only uses 18 mid-sized Amazon instances?

Intrigued? Join the talk to learn how it is possible!

Speakers

Alexander Filipchik

Principal Software Engineer, Sony Interactive Entertainment

Alex spent the last 4 years of his life building the next generation of the PlayStation Network. He is honored to be a part of the small team of engineers who managed to build a platform that scaled from 0 to 1 million users in just 1 day. This platform has been adding 1.5 million... Read More →

Thursday October 13, 2016 11:40am - 12:20pm EDT
Back Bay A Sheraton Boston

Use Case

Skill Level Intermediate

1:30pm EDT

Searching the Enterprise Data Lake with Solr - Watch us do it!

People talk a lot about building enterprise 'data lakes'or data hubs to knock down data silos and democratize data access to different types of users. These are abstract topics. Maybe it's time to stop talking and see, practically, how this can be done!

We are currently processing and searching data from 'data lake' for a large life sciences customer and in this demo we'll show you, step by step, how this is accomplished. We'll take disparate data sources like document files and data tables; we'll show how these records can be combined, processed, prepared and indexed; and then we'll show search and visualizations on this content to provide business insight into this 'data lake'. All of this will be done with Solr Cloud.

Speakers

Paul Nelson

Chief Architect, Search Technologies

Paul was an early pioneer in the field of text retrieval and has worked on search engines for over 25 years. He was the architect and inventor of RetrievalWare, a ground-breaking natural-language based statistical text search engine which he started in 1989 and grew to $50 million... Read More →

Thursday October 13, 2016 1:30pm - 2:10pm EDT
Commonwealth Sheraton Boston

Data Science

Skill Level Intermediate

1:30pm EDT

Microsoft's Use of Solr to Deliver a Multitenant Log Analytics SAAS Service

We will present architecture of Search service backing Microsoft Operations Management Suite's Log Analytics Solution. With Microsoft Operations Management Suite, you can now empower operations teams to effortlessly collect, store and analyze log data from virtually any Windows Server and Linux source-regardless of volume, format or location. Separate the signal from the noise with simple, powerful log management tools and access real-time operational intelligence with improved troubleshooting, operational visibility and fast search to explore, investigate and fix incidents quickly.

Join us as we share our experience and learnings in resolving issues all over the spectrum like scalability, COGS, compliance requirements, customer data isolation, data persistence, query response streaming. Learn what it takes to run SOLR on commodity hardware as well over scaled up architecture.

Speakers

Chirag Gupta

Software Engineer, Microsoft Corporation

Chirag Gupta is a software engineer at Microsoft Corporation. In his current role, he is responsible for building and monitoring scalable, performant, reliable, multitenant and COGs efficient platform for Microsoft Log Analytics SAAS service (Microsoft OMS). Previously, he has 15... Read More →

Srivatsan Parthasarathy

Partner Software Engineer, Microsoft Corporation

Srivatsan is a Partner Software Engineer at Microsoft Corporation. In his current role, he is responsible for architecture of Operations Management Suite, a management as a Saas offering that enables customers to manage their Linux or Windows assets on any cloud.

Thursday October 13, 2016 1:30pm - 2:10pm EDT
Gardner

Ecosystem

Skill Level Intermediate

1:30pm EDT

Cross Data Center Replication for the Enterprise

This presentation is meant to explore the use of cross data center replication, now available in Solr 6, to show a real-world example running in production. Iron Mountain has now been running cross data center replication (CDCR) for over a year. We have over 100,000 users and indexes supporting 26 clouds (5 billion documents) with rapid/continuous indexing. We rely on cross data center replication for disaster recovery and backups. This allows us to maintain a 'hot' standby environment for failover. We spent considerable effort performance testing and tuning CDCR as well as determining the hardware / storage required to support the system. There are some gotchas that need to be considered, such as the amount of disk space to allow for backups, network performance, adjusting configurations and monitoring.

CDCR works much like mirroring approaches for databases, yet there are some distinct differences in how this works for Solr. The implementation of CDCR was performed by several committers which Iron Mountain engaged to develop the capability. Representatives from the team will speak to the technical approach CDCR uses, the CdcrRequestHandler and versioning approach used.

Lastly, we will cover some possible future enhancements for CDCR, including improving throughput and extending the current code base to support active/active replication between multiple data centers.

Speakers

Adam Williams

Search Lead, Iron Mountain

Adam Williams has 17 years of experience as a software developer. Three years ago Adam began his journey with Solr after spending 10 years working on DOD modeling and simulation as well as Digital Asset Management projects for global pharmaceutical companies. Adam is currently... Read More →

Thursday October 13, 2016 1:30pm - 2:10pm EDT
Back Bay B Sheraton Boston

Exploring Solr

Skill Level Intermediate

1:30pm EDT

SearchHub or How to Spend Your Summer Keeping it Real

Dogfooding. Cobbler’s shoes. Whatever you want to call it, there’s nothing like building a real application on your own product to see the good, bad, and ugly of your own code. In this talk, we’ll walk through SearchHub, Lucidworks’ community powered site for Apache and other open source projects that indexes hundreds of different public data sources to showcase Fusion and Solr capabilities ranging from the simple (search, faceting) to the complex (Word2Vec, Recommenders, Random Forests). The talk will highlight key integration points between Spark and Solr and how they are leveraged to do search, recommendations, and machine learning on email and user feedback. We’ll also cover some interesting crawling use cases as well as how to leverage Fusion’s experiment management framework to run multi-arm bandit tests.

Speakers

Grant Ingersoll

CTO, Lucidworks

Grant is the CTO and co-founder of Lucidworks, co-author of Taming Text, co-founder of Apache Mahout and a long-standing committer on the Apache Lucene and Solr open source projects. Grant’s experience includes engineering a variety of search, question answering, and natural language... Read More →

Thursday October 13, 2016 1:30pm - 2:10pm EDT
Independence Sheraton Boston

Use Case

Committer Talk Committer Talk
Skill Level Intermediate

2:20pm EDT

Tuning Solr and its Pipeline for Logs

This is an updated talk about how to use Solr for logs and other time-series data, like metrics and social media. In 2016, Solr, its ecosystem, and the operating systems it runs on have evolved quite a lot, so we can now show new techniques to scale and new knobs to tune.

We'll start by looking at how to scale SolrCloud through a hybrid approach using a combination of time- and size-based indices, and also how to divide the cluster in tiers in order to handle the potentially spiky load in real-time. Then, we'll look at tuning individual nodes. We'll cover everything from commits, buffers, merge policies and doc values to OS settings like disk scheduler, SSD caching, and huge pages.

Finally, we'll take a look at the pipeline of getting the logs to Solr and how to make it fast and reliable: where should buffers live, which protocols to use, where should the heavy processing be done (like parsing unstructured data), and which tools from the ecosystem can help.

Speakers

Radu Gheorghe

Search Consultant & Software Engineer, Sematext Group, Inc.

Radu Gheorghe is a search consultant, software engineer and trainer at Sematext, working mainly with Solr, Elasticsearch and logging-related projects.

Rafał Kuć

Software Engineer, Sematext Group, Inc.

Rafał, in his professional life is a Sematext trainer, consultant and a software engineer, http://solr.pl co-founder and the Solr Cookbook and Elasticsearch Server books author. In his personal life Rafał is a father and a husband.

Thursday October 13, 2016 2:20pm - 3:00pm EDT
Commonwealth Sheraton Boston

Ecosystem

Skill Level Intermediate

2:20pm EDT

State of Solr Security 2016

Apache Solr has, over the past 1-2 years, developed lots of security related features. This talk focuses on exploring all features available to Solr users to help them secure their Solr installations, including authentication, authorization, storage level security, Zookeeper security, security against eavesdropping network packets, document level security, etc. This talk willll consist of simple examples of how to use these security features and also explore the current challenges users, esp. enterprise users, face in securing their Solr clusters, as well as future needs of the Solr users and the road ahead.

Speakers

Ishan Chattopadhyaya

Software Engineer, Lucidworks

Ishan Chattopadhyaya is an engineer at Lucidworks and a contributor to Apache Solr project. Prior to working at Lucidworks, Ishan has worked on Yahoo! Search team at Multimedia Search team and Shopping Vertical Search team. Ishan started his career with MapQuest (Aol)'s search, building... Read More →

Thursday October 13, 2016 2:20pm - 3:00pm EDT
Back Bay B Sheraton Boston

Exploring Solr

Skill Level Intermediate

2:20pm EDT

Near Real Time Indexing in Search

Imagine the frustration of the user, when they found their perfect wish while browsing, only to realize it later (when they clicked it) that it was out of stock or the price switched or it was not delivered at their location. This happens when the search index doesn't have the real-time availability, price and seller information. Hence it is a core challenge that an E-Commerce marketplace search engine has to solve. Regular document search index technologies (like Solr/Lucene) have trouble dealing with attributes which are in high constant flux (like availability, price) which are typically seller/listing specific attributes. In this talk, we present the challenges and our solutions for a customized search index for e-commerce addressing these challenges.

Speakers

Thejus V M

Data Architect, Flipkart

Thejus is a software engineer working on the search systems at Flipkart. His work has spanned across multiple aspects of Search such as high throughput indexing, managing large scale distributed infrastructure, semantic identification, auto suggestion, scoring models and more.

Umesh Prasad

SDE III, Flipkart

Umesh is a SDE -3 in Flipkart . He is the resident solr/lucene expert in Flipkart and has been instrumental in building critical frameworks and solutions for search team. Previously he built and evolved vertical search & content aggregation service for Verse Innovation. Currently... Read More →

Thursday October 13, 2016 2:20pm - 3:00pm EDT
Back Bay A Sheraton Boston

Use Case

Skill Level Intermediate

3:10pm EDT

Improving Enterprise find-ability with custom relevance models

On the surface search on the web and within the enterprise share some common characteristics. However, there are key differences that makes enterprise search a specialized domain. For each of salesforce's 150,000 customers, we enable search over highly diverse and custom data sets spanning 3 distinct forms -- CRM data in relational systems, unstructured data in content management systems and enterprise social data.

This talk shares some of the insights we’ve gleaned from building a relevance engine for the enterprise from the ground up. Specifically, we lay out the components that enable us to machine learn our ranking function from training to evaluation. We showcase the customizations applied on various boosts and query functions provided by Solr based on data type being searched for. Finally, we touch upon some of the metrics that are used to measure and optimize search relevance by document type.

Speakers

Jayesh Govindarajan

Senior Director, Search Relevance, Salesforce

Jayesh is the Senior Director of Search and Data Science at Salesforce. He joined Salesforce through the acquisition of MinHash, a data science startup he founded to focus on solving problems in entity extraction, topic classification, and trend detection on an enterprise platform... Read More →

Thursday October 13, 2016 3:10pm - 3:50pm EDT
Commonwealth Sheraton Boston

Data Science

Skill Level Intermediate

3:10pm EDT

Parallel SQL and Analytics with Solr

Analytics has increasingly become a major focus for Apache Solr, the primary search engine in the Hadoop stack. This talk will cover recent Solr developments in the areas of faceting and analytics, including parallel SQL, streaming expressions, distributed join, and distributed graph queries. Given the increasing number of APIs and techniques that can be brought to bear, we'll also cover which approach should be preferred in different situations, including how to maximize scalability.

Speakers

Yonik Seeley

Solr Dude, Cloudera

Yonik Seeley is the creator of Solr. He works at Cloudera integrating and leveraging "Big Search" technologies into the many components comprising the Cloudera enterprise data hub (EDH). Yonik was previously a co-founder of LucidWorks, and he holds a master's degree from Stanford... Read More →

Thursday October 13, 2016 3:10pm - 3:50pm EDT
Independence Sheraton Boston

Data Science

Committer Talk Committer Talk
Skill Level Intermediate

3:10pm EDT

Aggregations: Solrcloud/Elasticsearch, Druid or HBase

You need to build a highly scalable system for executing aggregation-queries in real-time on big-data. But you do not have several weeks to try each and every available technology that supports such queries and you are not sure which one to pick. We have taken time to build fully functional prototypes and have learned important lessons that can serve as precious time-saving guidelines while deciding about the architecture of your system.

To have an unbiased comparison, we installed each built prototype on a cluster of machines having exactly the same hardware configuration. We estimated the ingestion performance by measuring the time that each prototype needs in order to make the imported records become available for querying. We executed real-user aggregation-queries to measure the response time while simulating various ingestion loads. By increasing the number of machines that are used to run the built prototypes, we were able to estimate the ability of each technology to scale. Finally as a bonus, we will also share our subjective opinion regarding the easiness to use, flexibility, customizability and available community support for each evaluated technology.

Speakers

Dragan Milosevic

Chief Search Architect, Zanox AG

Dr. Dragan Milosevic is a certified Solr/Lucene, Hadoop and HBase developer and currently works as Chief Search Architect at Zanox AG. The firm has successfully implemented several Apache open-source projects for building a world-class reporting framework. He is also author of a book... Read More →

Thursday October 13, 2016 3:10pm - 3:50pm EDT
Gardner

Ecosystem

Skill Level Intermediate

10:30am EDT

Solr JDBC

One of the new features of Solr 6 is a JDBC driver that can be hooked up to various SQL clients and database visualization tools. Solr JDBC opens up a whole new set of use cases and lowers the barrier to entry for many users. I'll highlight the Solr JDBC feature, explain some use cases, and demonstrate connecting SQL clients. You will learn how Solr JDBC can unlock more potential from your Solr environment.

Speakers

Kevin Risden

Apache Lucene/Solr Committer; Hadoop and Search Tech Lead, Avalon Consulting, LLC

Kevin Risden, an Apache Lucene/Solr committer, has been consulting on search and Hadoop for over 3 years at Avalon Consulting, LLC. He has helped organizations successfully transform their big data into business results.

Friday October 14, 2016 10:30am - 11:10am EDT
Back Bay B Sheraton Boston

Exploring Solr

Committer Talk Committer Talk
Skill Level Intermediate

10:30am EDT

Building and running a Solr-as-a-Service for IBM Watson

Running a managed Solr service brings fun challenges with it, to both the users and the service itself. Users typically do not have access to all components of the Solr system (e.g. the ZK ensemble, the actual nodes that Solr runs on etc.). On the other hand the service must ensure high-availability at all times, and handle what is often user-driven tasks such as version upgrades, taking nodes offline for maintenance and more. In this talk I will describe how we tackle these challenges to build a managed Solr service on the cloud, which currently hosts few thousands of Solr clusters. I will focus on the infrastructure that we chose to run the Solr clusters on, as well how we ensure high-availability, cluster balancing and version upgrades.

Speakers

Shai Erera

STSM, Social Analytics & Technologies, IBM

Shai Erera is a Researcher at IBM Research, Haifa, Israel. Shai earned his M.Sc in Computer Science from the University of Haifa in 2007. Shaiâ€™s work experience includes the development of search-based systems over Lucene and Solr and he is also a Lucene/Solr committer.

Friday October 14, 2016 10:30am - 11:10am EDT
Back Bay A Sheraton Boston

Use Case

Committer Talk Committer Talk
Skill Level Intermediate

11:20am EDT

Working with deeply nested documents in SolrCloud

Until recently, Solr did not support deeply nested documents, but that has changed over the past few releases. While still not a popular use-case, Solr can now be used to handle deeply nested documents to perform search and faceting on them, like nested email threads, comments and replies on social media etc.

This talk would cover pointers around pre-processing of data so that it can not only be consumed by Solr but also make it possible to perform complex search and statistical aggregations on top of it. It would also cover query formation for sample use cases of nested data and multiple options and features that Solr provides for faceting or aggregation of the documents. By the end of this talk, Solr users would have a better understanding of both the features that Solr provides and how to work with them to find answers to interesting questions from deeply nested documents and the limitations that currently exist and how to work around to accomplish tasks indirectly.

Speakers

Anshum Gupta

Sr. Software Engineer, IBM Watson

Anshum Gupta is a Lucene/Solr committer and PMC member with over 10 years of experience with search. He is a part of the search team at IBM Watson, where he works on extending the limits and improving SolrCloud. Prior to this, he was a part of the open source team at Lucidworks and... Read More →

Alisa Zhila

Software Engineer, IBM Watson

Alisa Zhila is a software engineer in IBM Watson Core Technology. She has graduated from Moscow Institute of Physics and Technology, Russia, and received PhD in Computer Science from National Polytechnic Institute, Mexico. In IBM she has been working on transition of linguistically... Read More →

Friday October 14, 2016 11:20am - 12:00pm EDT
Commonwealth Sheraton Boston

Data Science

Committer Talk Committer Talk
Skill Level Intermediate

11:20am EDT

Time Series Processing with Solr and Spark

A lot of data is best represented as time series: Operational data, financial data, and even in data warehouses the dominant dimension is often time. We present Chronix, a time series database based on Apache Solr and Spark which is able to handle trillions of time series data points and perform interactive queries. Chronix Spark is open source software and battle-proven at a German car manufacturer and an international telco.

We demonstrate several use cases of Chronix from real-life. Afterwards we lift the curtain and deep-dive into the Chronix architecture esp. how we're using Solr to store time series data and how we've hooked up Solr with Spark. We provide some benchmarks showing how Chronix has outperformed other time series databases in both performance and storage-efficiency.

Chronix is open source under the Apache License (http://chronix.io).

Speakers

Josef Adersberger

CTO, QAware

Josef Adersberger is co-founder & CTO of QAware, a German custom software development company and CNCF silver member. He studied computer science in Rosenheim and Munich and holds a doctoral degree in software engineering. He is currently responsible for a large-scale cloud migration... Read More →

Friday October 14, 2016 11:20am - 12:00pm EDT
Gardner

Ecosystem

Skill Level Intermediate

11:20am EDT

Building a Vibrant Search Ecosystem at Bloomberg

Search is a core technology that allows Bloomberg to deliver financial news and information quickly and reliably to our clients. The Search Infrastructure team has created a high performance, stable and scalable search ecosystem to support a large, complex and diverse set of search applications.

Providing search as a service to the thousands of developers in this demanding environment required us to take a holistic approach. In this talk we'll discuss both the organizational and technical challenges we've encountered and the approach we've taken to solve them. We'll dive into the details of our platform; from the way we engage with our tenants, interact with the Solr community, to the infrastructure and tools we use to manage, monitor and scale our platform.

Speakers

Steven Bower

Team Lead, Search Infrastructure, Bloomberg LP

Steven has worked for 16 years in the search industry. First as part of the R&D/Services teams at FAST Search & Transfer and then as a principal engineer at Attivio, Inc. He has participated/lead the delivery of hundreds of search applications and now leads the Search Infrastructure... Read More →

Ken LaPorte

Senior Software Engineer, Search Infrastructure, Bloomberg LP

Ken is a senior software engineer in the Search Infrastructure department at Bloomberg where he works with client teams to leverage Solr to solve business problems. Ken has been active in the search domain for 7 years and has worked on a wide variety of search problems, including... Read More →

Friday October 14, 2016 11:20am - 12:00pm EDT
Independence Sheraton Boston

Use Case

Skill Level Intermediate

1:10pm EDT

Large Scale Solr at FullStory

Come see how we're using Solr to make search FullStory's central feature. Learn about some of the problems we've run into scaling up a large Solr cluster at FullStory, and how we've solved them. And finally, I'll briefly introduce Solrman, the open source service we've released that monitors a Solr cluster and automatically optimizes how data is distributed across a Solr cluster.

Speakers

Scott Blum

Staff Software Engineer, FullStory, Inc.

Scott Blum is a committer on Apache Solr and Apache Curator. His background includes compiler work on Google Web Toolkit and distributed systems experience at Square and most recently FullStory.

Friday October 14, 2016 1:10pm - 1:40pm EDT
Back Bay A Sheraton Boston

Use Case

Committer Talk Committer Talk
Skill Level Intermediate

1:10pm EDT

Combining Content and Collaboration in Recommenders

Recommender Systems are typically built on two different types of training data: historical user-engagement, and the textual content of the items themselves (either descriptive text, tags, structured metadata, or the actual raw content of text items on their own). This talk is an introductory overview of how to build a recommender system which uses both types of inputs to build a “mixed-mode” recommender, where you can parameterize (at request time, in some cases!) how much you want to rely on content, and how much on collaborative filtering. We’ll walk through building a horizontally scalable parameterized recommender service from just three components: Solr, Spark, and of course: training data.

Speakers

Jake Mannix

Chief Data Engineer, Lucidworks

Jake Mannix is the Chief Data Engineer at Lucidworks. Before joining Lucidworks, Jake worked on the Semantic Scholar project at the Allen Institute for Artificial Intelligence, and prior to that was tech lead for Twitter’s data science and data engineering teams, building both the... Read More →

Friday October 14, 2016 1:10pm - 1:50pm EDT
Commonwealth Sheraton Boston

Data Science

Skill Level Intermediate

1:10pm EDT

Customizing Ranking Models in Solr to improve relevance for Enterprise Search

Solr provides a suite of built-in capabilities that offers a wide variety of relevance related parameter tuning. Index and/or query time boosts along with function queries can provide a great way to tweak various relevance related parameters to help improve the search results ranking. In the enterprise space however, given the diversity of customers and documents, there is a much greater need to be able to have more control over the ranking models and be able to run multiple custom ranking models.

At Salesforce, we have a multi-level ranking pipeline, first ranker (L1), is the basic lucene scoring based on tf-idf and the second ranker (L2), implements more complex ranking models ranging from something as trivial as a linear regression to the more complex models such as a boosted decision tree. This L2 ranker inside Solr enables us to extract features for every document from within the Solr Index and leverage them during ranking model execution. This talk discusses the motivation behind creating an L2 ranker and the use of Solr Search Component for running different types of ranking models.

Speakers

Ammar Harris

Lead Member Technical Staff, Search Relevance, Salesforce

Ammar is a member of the Search Relevance team at Salesforce for over two years. He has been working with the team to build out a new framework for Salesforce Search that would enable the team to train\test, experiment and ship multiple relevance models to production. Prior to joining... Read More →

Joe Zeimen

Senior Member of Technical Staff, Salesforce

Joe currently works on the Search Relevance team at Salesforce. He is helping to build out a new framework for Salesforce Search that enables the team to train, test and ship different relevance models to production. Prior to joining Salesforce over 3 years ago he earned his BS/MS... Read More →

Friday October 14, 2016 1:10pm - 1:50pm EDT
Back Bay B Sheraton Boston

Use Case

Skill Level Intermediate

1:10pm EDT

HHypermap: Heatmap Analytics of a Billion Tweets

The Harvard Center for Geographic Analysis has established the HHypermap (Harvard Hypermap) system, comprised of multiple open-source projects aimed at searching vast amounts of spatial data. This talk centers on a system based on SolrCloud that can do realtime search on a billion Twitter tweets with heatmap analytics of sentiment analysis. The open-source system is designed to be suitable for social media data sets or sensor data.

Harvard CGA commissioned Apache Lucene/Solr's heatmap faceting capability in 2015 and this work now continues in 2016. The first new part is computing numeric stats per cell (not just doc counts), which can be used for a variety of applications. The second part is improving Lucene's grid cell indexing scheme to cater to heatmaps, thus allowing heatmap generation to be very fast for large data sets.

This talk discusses the system design/architecture as well as the spatial details on how Lucene/Solr was improved.

Speakers

David Smiley

Search Developer & Consultant, D W Smiley LLC

Friday October 14, 2016 1:10pm - 1:50pm EDT
Independence Sheraton Boston

Use Case

Committer Talk Committer Talk
Skill Level Intermediate

2:00pm EDT

Reflected Intelligence: Lucene/Solr as a self-learning data system

What if your search engine could automatically tune its own domain-specific relevancy model? What if it could learn the important phrases and topics within your domain, automatically identify alternate spellings (synonyms, acronyms, and related phrases) and disambiguate multiple meanings of those phrases, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain?

In this presentation, you’ll learn you how to do just that - to evolving Lucene/Solr implementations into self-learning data systems which are able to accept user queries, deliver relevance-ranked results, and automatically learn from your users’ subsequent interactions to continually deliver a more relevant experience for each keyword, category, and group of users.

Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the relevance signals present in the collective feedback from every prior user interaction with the system. Come learn how to move beyond manual relevancy tuning and toward a closed-loop system leveraging both the embedded meaning within your content and the wisdom of the crowds to automatically generate search relevancy algorithms optimized for your domain.

Speakers

Trey Grainger

SVP of Engineering, Lucidworks

Trey is the SVP of Engineering at Lucidworks, where he leads their engineering efforts around Lucidworks Fusion, Apache Lucene/Solr, and their other open source and commercial offerings. Trey is also the co-author of the book Solr in Action, as well as a published researcher and frequent... Read More →

Friday October 14, 2016 2:00pm - 2:40pm EDT
Independence Sheraton Boston

Data Science

Skill Level Intermediate

2:00pm EDT

How to run Solr on Docker. And why.

Docker is all the rage these days. While one doesn't hear much about Solr on Docker, we're here to tell you not only that it can be done, but also share how it's done.

We'll quickly go over the basic Docker ideas - containers are lighter than VMs, they solve "but it worked on my laptop" issues - so we can dive into the specifics of running Solr on Docker.

We'll do a live demo showing you how to run Solr master - slave as well as SolrCloud using containers, how to manage CPU assignments, constraint memory and use Docker data volumes when running Solr in containers. We will also show you how to create your own containers with custom configurations.

Finally, we'll address one of the core Solr questions - which deployment type should I use? We will demonstrate performance differences between the following deployment types:

- Single Solr instance running on a bare metal machine
- Multiple Solr instances running on a single bare metal machine
- Solr running in containers
- Solr running on virtual machine
- Solr running on virtual machine using unikernel

For each deployment type we'll address how it impacts performance, operational flexibility and all other key pros and cons you ought to keep in mind.

Speakers

Radu Gheorghe

Search Consultant & Software Engineer, Sematext Group, Inc.

Radu Gheorghe is a search consultant, software engineer and trainer at Sematext, working mainly with Solr, Elasticsearch and logging-related projects.

Rafał Kuć

Software Engineer, Sematext Group, Inc.

Friday October 14, 2016 2:00pm - 2:40pm EDT
Commonwealth Sheraton Boston

Ecosystem

Skill Level Intermediate

2:00pm EDT

Coffee, Danish & Search: How to build a Solr-powered news search engine

We'll show how we have worked with Denmark's leading media analysis company on a successful project to migrate their entire search framework from Autonomy IDOL & Verity to one based on Solr Cloud and our own Luwak stored search library, itself based on Lucene. We'll describe how we helped the client translate thousands of existing queries to their own query language; enhanced Solr wildcard search performance; built custom highlighting; extended Solr logging, and developed a framework to handle multiple languages (including one spoken by only 66,000 people). We'll show how the migration achieved practically zero negative change in precision/recall and how the continuing partnership with our client enables further feature development as necessary.

Speakers

Charlie Hull

Managing Director, Flax

Charlie is the co-founder of Flax, the UK's leading specialists in open source search. The Flax team have decades of experience in delivering accurate, fast and scalable solutions to a wide range of UK and international clients. Charlie runs the London Lucene/Solr Meetup, is known... Read More →

Alan Woodward

Director, Flax

Alan Woodward worked for many years at Proquest on a large scale multinode installation of the FAST ESP search engine, and gained skills in managing search applications over hundreds of millions of documents. Alan is a Lucene/Solr Committer. At Flax Alan has worked on the development... Read More →

Friday October 14, 2016 2:00pm - 2:40pm EDT
Back Bay A Sheraton Boston

Use Case

Committer Talk Committer Talk
Skill Level Intermediate

2:50pm EDT

Your Big Data Stack is Too Big

While technologies such as Spark, Hadoop, and Solr have come a long way over the past couple of years, companies continue to struggle to convert all this innovation into successful business outcomes. Too often, big data projects run over budget and fail to deliver ROI. Instead, companies are left with a bloated stack of complex technologies that are cumbersome to maintain and are slow to adapt to new business requirements. Once the consultants have left the building, the big data platform fails to keep up with demands for better access to larger and more complex enterprise data sets.

In this talk, Tim presents a better way to go about big data analytics using Lucidworks Fusion. Attendees will come away with actionable insights to solving common big data problems such as scaling data ingest from any source, providing both full-text search and SQL query capabilities for the same data set, and leveraging machine learning. The goal of this talk is to parse through the hype of big data and show how a lean, tightly integrated stack built on Solr and Spark provides all you need to do big data right.

Speakers

Timothy Potter

Senior Software Engineer, Lucidworks

Timothy Potter is a senior member of the engineering team at Lucidworks and PMC member of the Apache Lucene/Solr project. At Lucidworks, Tim leads a team that builds tools to empower business analysts and data scientists to search, analyze, and visualize large-scale enterprise data... Read More →

Friday October 14, 2016 2:50pm - 3:30pm EDT
Back Bay B Sheraton Boston

Data Science

Committer Talk Committer Talk
Skill Level Intermediate

2:50pm EDT

Solr Graph Query

This is an overview of the new Solr Graph Query. We will discuss the semantics of this new query operator in Lucene/Solr and how it can be used to solve real world knowledge graph problems. We will discuss how to handle data that is a graph in nature and cover items such as social networking search, recommendation engines, security filtering, and how to use knowledge graphs and ontologies to draw conclusions, all using the Solr Graph query.

Speakers

Kevin Watters

Founder, KMW Technology

Kevin is a long time user of and contributor to Solr. He has been running a small search engine professional services firm in Boston called KMW Technology.

Friday October 14, 2016 2:50pm - 3:30pm EDT
Back Bay A Sheraton Boston

Exploring Solr

Skill Level Intermediate

2:50pm EDT

Using Apache Solr for Images As Big Data: A Case Study

Images as big data' is an especially interesting topic in the era of high-performance systems based on Solr, Hadoop, and Apache Spark. Machine learning and image analysis packages are readily available to apply to this problem, and high quality industrial applications may be built from off-the-shelf third party components. In this talk, we will discuss a case study based on an 'image as big data' analytical system --- the Image as Big Data Toolkit (IABDT). IABDT uses Lucene, distributed Lucene, Solr, and Hadoop as key component technologies. We will present examples of IABDT in action, using Solr as a key search technology in its implementation, show a medical image case study, and discuss future work and extensions of the IABDT system.

Speakers

Kerry Koitzsch

Project Lead / Principal Software Engineer, Kildane Software Technologies Inc.

Kerry Koitzsch has had more than twenty years of experience in the computer science, image processing, and software engineering fields, and has worked extensively with Apache Lucene and Solr technologies in particular. Kerry specializes in software consulting involving customized... Read More →

Friday October 14, 2016 2:50pm - 3:30pm EDT
Independence Sheraton Boston

Use Case

Skill Level Intermediate