preview

Nt1330 Unit 1 Assignment

Decent Essays

Assignment Goals: For this assignment, I was allowed to improvise on a provided base code to develop a functioning web crawler. The web crawler needed to accept a starting URL and then develop a URL frontier queue of “out links” to be further explored. The crawler needed to track the number of URLs and stop adding them once the queue had reached 500 links. The crawler needed to also extract text and remove HTML tags and formatting. The assignment instructions offered using the BeautifulSoup module to achieve those goals, which I chose to do. Finally, the web crawler program needed to report metrics including the number of documents (web pages), the number of tokens extracted and processed, and the number of unique terms added to the term dictionary. …show more content…

I was able to find guidance on how to do so through a post on Stack Overflow (Kheir, 2016). Initially, I attempted to install the most current version of the BeautifulSoup module but it was not the version the base code we were given runs. After a minor copy and paste error in a command line, I successfully installed the module. Completed BeautifulSoup Installation After that I needed to make some minor adjustments to the code to accommodate for my system and where the final database would be stored. I included the PorterStemmer code we previously used and verified there were no issues with the contents of the indexer portion of the program for the development of the database. For my run of the web crawler, I used the starting site recommended in the assignment instructions, http://www.hometownlife.com, and allowed it to run its course. This took over 3 hours to perform. Initializing Web Crawler Web Crawler Mid-run Web Crawl Complete, Creating

Get Access