SimCloud: Similarity-aware Data Analysis in Cloud-based Systems

Similarity-aware operations, e.g., Similarity Join, Similarity Selection, and Similarity Grouping, are among the most useful data processing and analysis operations. Multiple application scenarios need to perform this operation over large amounts of data. Internet companies, for instance, collect massive amounts of data such as content produced by web crawlers or service logs, and can use similarity queries to gain valuable understanding of the use of their services, e.g., identify customers with similar buying patterns, generate recommendations, perform correlation analysis, etc. Cloud systems and MapReduce, its main framework for distributed processing, constitute an answer to the requirements of processing massive amounts of data in a highly scalable and distributed fashion.

The main goals of the SimCloud project are to study, design, implement and evaluate similarity operators for cloud systems. We are particularly interested on techniques that use the MapReduce framework.

News:

Paper accepted in the International Conference on Similarity Search and Applications (SISAP '23)! (August, 2023)

Our paper "Diversity Similarity Join for Big Data" co-authored by Y.N. Silva, undergraduate students J. Martinez and P. Castro Cea, and Brazilian collaborators H. Razente and M. C. Nardini Barioni, has been accepted as a full paper in the 16th International Conference on Similarity Search and Applications (SISAP 2023). This paper focuses on the design and evaluation of a highly distributed operator to reduce and diversify the output of the similarity join operation using big datasets. We look forward to sharing the results of this work in La Coruña, Spain later this year!

Paper accepted in Information Systems! (January, 2020)

The paper "Pivot-based Approximate k-NN Similarity Joins for Big High-dimensional Data" co-authored by a previous visiting scholar (P. Cech), J. Lokoc and Y. N. Silva was accepted for publication in the Information Systems journal (Elsevier). This paper focuses on the implementation and evaluation of k-NN Similarity Join algorithms for high-dimensional data using the Spark and Hadoop frameworks.

Paper accepted in the International Conference on Similarity Search and Applications (SISAP '19)! (July, 2019)

The paper "An Experimental Survey of MapReduce-based Similarity Joins" co-authored by Y.N. Silva, three undergraduate students (M. Sandoval, D. Prado, X. Wallace) and a former visiting scholar (C. Rong) has been accepted in the 12th International Conference on Similarity Search and Applications (SISAP 2019). The paper focuses on the design and implementation of similarity grouping algorithms for Big Data. We look forward to presenting our work in Newark, NJ.

Undergraduate research proposal funded (May, 2018)

Our undergraduate research proposal entitled "Similarity Grouping for Big Data” has been awarded funding from the Western Alliance to Expand Student Opportunities (WAESO). This award will support the work of four students in our team (Diana Prado, Manuel Sandoval Madrigal, Emerson Cristal and Xavier Wallace). Thank you WAESO!

Paper accepted in the International Conference on Data Engineering (ICDE '17)! (Jan., 2017)

Our research paper "Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics" co-authored by C. Rong, C. Lin, Y. N. Silva and J. Wang has been accepted in the 33rd International Conference on Data Engineering (ICDE 2017). The paper proposes an efficient approach to solve the set similarity join problem using an novel partitioning approach that avoids the generation of duplicate records. We look forward to presenting our work in San Diego, CA, USA.

Paper accepted in the International Conference on Similarity Search and Applications (SISAP '16)! (Aug., 2016)

The paper "An Experimental Survey of MapReduce-based Similarity Joins" co-authored by Y.N. Silva, three previous undergraduate students (J.M. Reed, A. Wadsworth, and K. Brown) and a visiting scholar (C. Rong.) has been accepted in the 9th International Conference on Similarity Search and Applications (SISAP 2016). The paper focuses on the study, classification and experimental comparison of previously proposed Similarity Join algorithms for Big Data. We look forward to presenting our work in Tokyo, Japan.

Paper accepted in Frontiers of Computer Science (FCS) (Feb., 2016)

The paper "String Similarity Join with Different Similarity Thresholds Based on Novel Indexing Techniques" coauthored by C. Rong, Y. N. Silva and C. Li. has been accepted in the Frontiers of Computer Science journal (FCS). In this paper, we focus on the study of similarity operators that support multiple similarity thresholds.

NCUIRE Award Supports the SimCloud Project (May 2015)

NCUIRE Adelbert (A.J.) Wadsworth, one of our SimCloud team members, has received an NCUIRE Scholarship award to work with Prof. Silva during the summer of 2015. As part of this project, A.J. and Prof. Silva will explore Similarity Grouping techniques for Big Data. NCUIRE is a New College (ASU) program that supports meaningful research collaborations between faculty and undergraduates. Congratulations to A.J. and Prof. Silva!

We received Amazon Research and SRCA Grants! (April, 2015)

Amazon Our proposal "Similarity Grouping for Big Data" recently received an Amazon Web Services (AWS) Research grant. In addition, our proposal "Study, Classification and Benchmarking of Similarity Join Algorithms for Big Data" was awarded a grant from the Scholarship, Research and Creative Activities Grant Program (SRCA) at ASU - New College. We look forward to keep working on developing new techniques to identify similarities in large datasets.

CRC Book Chapter!

CRC Press Our team wrote a chapter for the Geographical Information Systems: Trends and Technologies book to be published in March 2014 (CRC). The chapter, Similarity Join for Big Geographic Data, focuses on the study of highly scalable similarity joins with geographic data and distance functions.

Featured in ASU News!

ASU News An ASU News article highlighted the participation of undergraduate student Jason Reed and Dr. Silva in VLDB/Cloud-I 2012, Istanbul, Turkey. Read it here, and be sure to keep posted for more SimCloud News!

 

WAESO Scholarship Awarded to Lisa Tsosie

WAESO A big thanks to the Western Alliance to Expand Student Opportunities (WAESO) for approving the proposal submitted for the Faculty-Directed Undergraduate Research Program and awarding Lisa Tsosie with a scholarship for the Fall semester! Stay posted for progress in the current project.

VLDB 2012 Cloud-I Workshop Paper Presented

The presentation went great and we really enjoyed meeting everyone there. The conference had a large number of very interesting and exciting presentations, it is great to see our hard work there next to these other exceptional works. For details of our presentation please go to our Publications page.

VLDB 2012 VLDB 2012 VLDB 2012

Featured in ASU Newsletter!

What's New An article by Matt Crum covered topics of our research project as well as another project supervised by Dr. Silva in the ASU New College Newsletter. Read the article for more information and for Jason's take on the project.

 

VLDB 2012 Travel Fellowship Awarded to Jason Reed

Congratulations! Our very own Jason Reed has been awarded the VLDB 2012 Travel Fellowship for travel to VLDB 2012 in Istanbul. He will be attending the VLDB 2012 conference and helping to present MapReduce-based Similiarity Join for Metric Spaces at the Cloud-I workshop. See you all in Istanbul!

VLDB 2012 Cloud-I Workshop Paper Accepted

Our submission to VLDB Cloud-I workshop for the presentation of MRSimJoin Algorithm has been accepted! We look forward to seeing you in Istanbul. Stay posted for pictures from the conference.

SIGMOD 2012 Demo Paper and Poster Presented

The demonstration went well! We had very interesting conversations and are particularly excited about the possibility of collaborating with industrial lab researchers interested on using our MRSimJoin algorithm in real-world applications. See our Posters page for details of the presentation.

SIGMOD 2012 SIGMOD 2012 SIGMOD 2012

SIGMOD 2012 Demo Paper and Poster Accepted

Our submission of the Demonstration of MRSimJoin has been accepted at SIGMOD 2012! See you in Scottsdale. Keep posted for pictures from the conference.

New College Student Expo 2012 Poster Presentation

We're presenting the results of our NCUIRE research at the ASU New College Student Expo! See our Posters page for details of the poster exhibition.

New College Student Expo 2012

Presenting at Inquire about NCUIRE

We presented results of our NCUIRE project: Cloud Similarity Join for Multi-Dimensional Data at the Inquire about NCUIRE event! This was an opportunity for the the NCUIRE scholars/fellows to present their research projects and interact with other participating students and students "inquiring" about the program. See our Posters page for more details.

NCUIRE 2011 Awarded to Jason Reed

Congratulations to our undergraduate student, Jason Reed, for applying for and receiving the NCUIRE Scholarship for the academic terms of Spring and Fall of 2011. Competition's tough, but we're tougher! With the help of the scholarship, we'll be pursuing research in exploiting MapReduce-based Similarity Joins using the Hadoop platform. Keep posted for progress.