Hi, If we are crawling web pages with images and videos then how we should be extending this system to store images and videos of crawled website (storage type etc.)? Can we use AWS S3 (or any other object store db)?
Since each element in a HTML page is also identified by a URL, that can be fetch using HTTP protocol, we can configure our crawler to read the HTTP header first and check the
contet-type and based on the type of this content, store it in a different place or perform a different type of processing. Of course, downloading videos and other medias contents would increase the storage and processing time.
To store this kind of data, I don’t see any problem by using AWS S3, this storage is good to cache its contents in CDNs and if the crawler will be used for some other application that consumes videos you downloaded, it can helps. Also, you can use HDFS, a distributed file system used by Hadoop, which is very fast and redundant and you can perform some batch processing in order to extract metadata, thumbnails, etc.
Hope this helps!
To build on what Artur said to respond Jayesh’s question. I will also say that the difference of approaching a regular HTML URL and a Media URL(say .jpg) is that HTML URL will require further processing (as Crawler needs to identify more URL contained in the HTML), but the Media file should be processed in a different want (eg, calculate checksum and map that, then send to a storage, etc. )