Context for remaining questions Natural Language Processing (NLP) is the field of computer science that is concerned with using computers to make sense of language as it is spoken naturally. One of the most commonly used formulas is "tf-idf". We use tf-idf to quantify how important a particular word -- also called a term in this context -- is to the document it appears in. "tf" stands for "term frequency". We calculate it by taking the number of times a term appears in the document, and dividing it by the total term count (just like the word count of an essay), like so: tf = (number of times the term appeared in the document)/(total word count for the document) "idf" stands for "inverse document frequency". Document frequency is simply the number of documents the term appeared in at least once in your entire document collection. A document collection could be a collection of books or articles, or all of the webpages returned by a search result, or all the reviews on a single product on Amazon, etc. This term helps us lower the importance of words that are so common in this collection that it's meaningless that they are present in a document. For instance, most documents will contain common words like "the" or "and" many times. That doesn't mean that the document is about those words. Document frequency is calculated like so: df = (number of documents the term appears in at least once)/(total number of documents in the collection) To get inverse document frequency, you just divide one by the document frequency like so: idf = 1/df = (total number of documents in the collection)/(number of documents the term appears in at least once) Finally, to calculate tf-idf, you multiply tf by idf. This is mathematically identical to dividing tf by df. Question 2 a) Write code that opens the file "term_data.txt" and loads data into the following variables, in this order: termCount = number of times the term appeared in the document length = total word count for the document docCount = number of documents the term appears in at least once totalDocs = total number of documents in the collection Hint: You will need to include the right header file to complete this question. b) Continue by adding code that calculates tf, idf, and tf-idf, and prints all three to the console.

Programming with Microsoft Visual Basic 2017
8th Edition
ISBN:9781337102124
Author:Diane Zak
Publisher:Diane Zak
Chapter11: Sql Server Databases
Section: Chapter Questions
Problem 4MQ3
icon
Related questions
Question
100%

i added the term_data.txt screenshot so that it would be easier for you to see it then make the txt file from your own PC

term_data - Notepad
File Edit
View
12 745 1459 1000000
Transcribed Image Text:term_data - Notepad File Edit View 12 745 1459 1000000
Context for remaining questions
Natural Language Processing (NLP) is the field of computer science that is concerned with using computers to make sense of language as it is spoken naturally. One of the most commonly used formulas is "tf-idf". We use tf-idf to quantify how
important a particular word -- also called a term in this context -- is to the document it appears in.
"tf" stands for "term frequency". We calculate it by taking the number of times a term appears in the document, and dividing it by the total term count (just like the word count of an essay), like so:
tf = (number of times the term appeared in the document)/(total word count for the document)
"idf" stands for "inverse document frequency". Document frequency is simply the number of documents the term appeared in at least once in your entire document collection. A document collection could be a collection of books or articles, or all
of the webpages returned by a search result, or all the reviews on a single product on Amazon, etc. This term helps us lower the importance of words that are so common in this collection that it's meaningless that they are present in a document.
For instance, most documents will contain common words like "the" or "and" many times. That doesn't mean that the document is about those words.
Document frequency is calculated like so:
df = (number of documents the term appears in at least once)/(total number of documents in the collection)
To get inverse document frequency, you just divide one by the document frequency like so:
idf = 1/df = (total number of documents in the collection)/(number of documents the term appears in at least once)
Finally, to calculate tf-idf, you multiply tf by idf. This is mathematically identical to dividing tf by df.
Question 2
a) Write code that opens the file "term_data.bit" and loads data into the following variables, in this order:
termCount = number of times the term appeared in the document
length = total word count for the document
docCount = number of documents the term appears in at least once
totalDocs = total number of documents in the collection
Hint: You will need to include the right header file to complete this question.
b) Continue by adding code that calculates tf, idf, and tf-idf, and prints all three to the console.
Transcribed Image Text:Context for remaining questions Natural Language Processing (NLP) is the field of computer science that is concerned with using computers to make sense of language as it is spoken naturally. One of the most commonly used formulas is "tf-idf". We use tf-idf to quantify how important a particular word -- also called a term in this context -- is to the document it appears in. "tf" stands for "term frequency". We calculate it by taking the number of times a term appears in the document, and dividing it by the total term count (just like the word count of an essay), like so: tf = (number of times the term appeared in the document)/(total word count for the document) "idf" stands for "inverse document frequency". Document frequency is simply the number of documents the term appeared in at least once in your entire document collection. A document collection could be a collection of books or articles, or all of the webpages returned by a search result, or all the reviews on a single product on Amazon, etc. This term helps us lower the importance of words that are so common in this collection that it's meaningless that they are present in a document. For instance, most documents will contain common words like "the" or "and" many times. That doesn't mean that the document is about those words. Document frequency is calculated like so: df = (number of documents the term appears in at least once)/(total number of documents in the collection) To get inverse document frequency, you just divide one by the document frequency like so: idf = 1/df = (total number of documents in the collection)/(number of documents the term appears in at least once) Finally, to calculate tf-idf, you multiply tf by idf. This is mathematically identical to dividing tf by df. Question 2 a) Write code that opens the file "term_data.bit" and loads data into the following variables, in this order: termCount = number of times the term appeared in the document length = total word count for the document docCount = number of documents the term appears in at least once totalDocs = total number of documents in the collection Hint: You will need to include the right header file to complete this question. b) Continue by adding code that calculates tf, idf, and tf-idf, and prints all three to the console.
Expert Solution
steps

Step by step

Solved in 3 steps with 2 images

Blurred answer
Knowledge Booster
Bare Bones Programming Language
Learn more about
Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.
Similar questions
  • SEE MORE QUESTIONS
Recommended textbooks for you
Programming with Microsoft Visual Basic 2017
Programming with Microsoft Visual Basic 2017
Computer Science
ISBN:
9781337102124
Author:
Diane Zak
Publisher:
Cengage Learning
CMPTR
CMPTR
Computer Science
ISBN:
9781337681872
Author:
PINARD
Publisher:
Cengage
Np Ms Office 365/Excel 2016 I Ntermed
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:
9781337508841
Author:
Carey
Publisher:
Cengage
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:
9780357392676
Author:
FREUND, Steven
Publisher:
CENGAGE L