Fixed small errors and added info

Angelogeb · Angelogeb · commit ccb168d2da28 · 2018-11-11T11:29:58.000+01:00
diff --git a/hw2/README.md b/hw2/README.md
@@ -1,5 +1,10 @@
 # Data Mining Homework 2
 
+# Report
+
+The report can be found in `report.html`. It uses MathJax from a CDN therefore
+in case of missing network connection the formulas will be unreadable.
+
 # Problem 1
 
 The program requires flask as a dependency otherwise just run the
@@ -20,3 +25,8 @@ See usage section and examples in report.html
 $ cd p3
 $ ./spark_lsh.py
 ```
+
+# Notes
+
+Problem 2 and 3 require the mmh3 package although they provide a plain python
+implementation
diff --git a/hw2/report.html b/hw2/report.html
@@ -73,7 +73,7 @@ <h3 id="conclusions-and-instructions-to-run">Conclusions and instructions to run
 <p>The system is written in <code>python3</code> and presented through a web interface served by a Flask server. To launch it, go in the <code>p1/</code> folder and execute <code>server.py</code>. After some seconds a new tab will be opened in the web browser with the interface to the tool.</p>
 <p><img src="images/q1.png" alt="First" /> <img src="images/q2.png" /></p>
 <p>The decisions made work well overall. Some examples of queries and results are shown above.</p>
-<p>In the first query a well known issue of cosine-similarity is shown: shorter documents get higher ranking since normalization penalizes longer documents. Indeed even if the first document contains only one of the query terms, still achieves a higher score than a longer document containing both terms.</p>
+<p>In the first query a well known issue of cosine-similarity is shown: shorter documents get higher ranking since normalization penalizes longer documents. Indeed even if the first document contains fewer occurrences of the query terms, still achieves a higher score than a longer document containing more occurrences of the terms.</p>
 <p>In the second example the query issued is the first document retrieved. As shown in this case the score achieved by the first document is way higher than the average cases given that the query is the document itself. Moreover also the processing time increases drastically.</p>
 <h2 id="problem-2">Problem 2</h2>
 <p>For this problem the <code>preprocessed_announcements.tsv</code> file produced by the preprocessing phase of the previous problem is used.</p>
diff --git a/hw2/report.txt b/hw2/report.txt
@@ -91,8 +91,8 @@ queries and results are shown above.
 In the first query a well known issue of cosine-similarity is shown:
 shorter documents get higher ranking since normalization penalizes longer
 documents. Indeed even if the first document
-contains only one of the query terms, still achieves a higher score than
-a longer document containing both terms.
+contains fewer occurrences of the query terms, still achieves a higher score
+than a longer document containing more occurrences of the terms.
 
 In the second example the query issued is the first document retrieved.
 As shown in this case the score achieved by the first document is way