-
Notifications
You must be signed in to change notification settings - Fork 1
Survey
Any survey has to be evaluated. There are two main points that need to be analyzed. (1) Statistical significance, going from representative sample size (either of data for ML labiling or of people for social/opinion surveys). (2) Facial validation: Through the survey, do we measure what we intend to measure? This needs a discussion especially on threats to validity. Both points are intertwined.
For our survey, we intend to measure for each article, whether it can be detected as describing a software language. We want to take this measure to evaluate an ML classifier. So, the decision cannot be made based on background knowledge alone, but based on knowledge that is there in the text. If it is not made explicit in the text no machine can recognize it and the article should not be added even if the title is 'Java (programming language)'. If an expert still tells that it is a language even though it is not explicitly stated, this corrupts the measurement. The corruption is a threat to the usefulness of the survey. We tried to counter it by:
-
Providing an informal definition what a software language is. (This definition may not be final, because final definitions are complex. A final definition also depends on the task where such definition is used. If we want to recognize languages used in projects, then this implies a certain useful definition of a software language as well, which is why we state that there are digital artifacts and not sketches on paper. This thought goes into the direction on how ontologies are evaluated based on the task they are used for and multiple ontologies exist for the same domain.)
-
Mixing in 9 seed articles. 5 at the beginning and 4 randomly mixed in. This allows us to measure whether a participant has the ability to recognize a description of a software language. We only recall seed articles, where the summary explicitly describes a language. From such description we have to be able to learn further knowledge about the language. Otherwise it is just a casual mention and worthless information.
-
The mixed in seed articles reduce fatigue. As stated in the abstract `Be careful not to miss relevant articles.'. If less software language articles existed in a questionnaire, the reader might become more careless.
-
We recommend voting by tendency and commenting in difficult cases. Sometimes, even explicit descriptions may not be clearly formulated and are matter to interpretation. Natural language is context sensitive. Since natural language text provides the decision ground, we have to acknowledge that things may be a matter to interpretation.
-
Embedding the Wikipedia article also reduces fatigue and the risk that someone closes the tab where the survey is opened by accident.
-
The hints help with decision ground and prime towards a certain way of decision making. The participant is not supposed to become too careless and decide by an article's title or only based on the first sentence. The hints raise awareness of the issue that we are interested in a fine grained level of detection and multiple topics are described as major topics instead of one.
-
We tracked the time every participant took per page. The time spent on the page that contains the primer might tell us how well someone read what to look out for. The time should go up for difficult questions as well.
Altogether, the matter of subjectivity and personal understanding of when to say that an article describes a software language is reduced because a decision ground is provided. The only problem is that participants might not follow the hints and still decide by gut feeling (and the given definition alone).