Grouping websites automatically
Grouping websites based on homepage, using text and HTML information.
Each day thousands of websites are created and need to be monitored. Both to gain insights and to spot malicious ones as soon as possible. In order to keep track of what is going on with new websites, we created a solution that clusters websites based on their textual contents.
Website pages are a mix of images, text and general layout information. To be able to extract data from those pages, we built a powerful scraper. After this, grouping website homepages together is very challenging: each website can have a long or a short text, can contain information in several languages, can use uncommon words. An extra challenge is represented by websites using mostly images to convey their content, which requires a different processing step.
Using Machine Learning, we were able to group websites together in more than 10 categories. Some categories represented a business category like hospitality, consultancy or home related activities. Others represented more general categories such as a reserved domain or an informative website.
Thanks to this solution, it is possible to monitor trends in the websites that are created or changing content, making sure that each website is compliant to the regulations and not malicious.