Does AI speak in English? Yes, it does! But it doesn’t understand English like we do. What does that mean? Well, AI only understands numbers. Furthermore, AI understands word meanings and how those meanings vary depending on the context through the process of embedding representations. To understand this further, read the following set of lines.
“Jimmy was smoking a cigar while he prepared a smoked barbecue steak for himself. Meanwhile, he noticed that the neighbor’s house had smoke emanating from the room. He waved at her. She had smoky eyes as she was dressing up for a party.” How do you think AI would understand and interpret the meaning of the word ‘smoke’, which is different across all these sentences?
Additionally, AI associates some words with numbers in order to improve its comprehension of English and produce results that we are accustomed to. Thus, text-based input is transformed into an understandable format for AI, and the output is then rewritten into English. These various techniques enable AI to recognize words and associate them with the meanings of various texts (the same applies to images, videos, and other types of data, but let’s focus on context for now).
From this point on, those of you that learn by associating with vibrant imagery in their mind, rub your hands and set the brain cells on full throttle. Each of these vectors, which are located in a multidimensional space, denotes a characteristic or meaning associated with a particular word. (Think of it as your brain mapping the meanings of different words according to context and meaning. To help you visualize, think of this as a word in an empty space that is connected to other words with similar meaning , and this word placed in different contexts and so on.)
The core part of helping AI numerically map out the associations of these words is done through embedding representations. During the embedding process, the text that is provided is split into tokens, which are words or sub-words (the process of doing this is called tokeization).
Next, the words need to be converted into vectors through a process called vectorization. Vectors are a numeric representation of data in numeric form. Next, we’ll discuss these techniques used to achieve this.
- The first technique is Word2Vec, developed by Google. It trains a large corpus of text; it uses neural networks to learn associations. There are two approaches this technique takes:
- CBOW (Continuous Bag of Words), where a target word is predicted based on context.
- Skip-gram, where the context is predicted based on the word provided. What’s the outcome of either of these approaches?
The words that have similar meanings are given vectors close to each other in the vector space.
- The next technique is called GloVe (Global Vectors for Word Representation), which mostly places those vectors together that have similar characteristics, for instance, cigars and cigarettes; kind and queen. This technique was made by Stanford.
- Another technique you can look up is BERT (Bidirectional Encoder Representations from Transformers), again developed by Google. The core outcome you can expect from this model is that it embeds vectors of words that have different meanings in different contexts (like the meaning of the word “smoke” we previously spoke of).
And these are the techniques used for embedding representations. It is a crucial step, as it is the cornerstone upon which AI itself identifies, associates, and matches up the meanings of different words.
From here, AI can also capture the statistical patterns of the frequency of the words used, length of sentences, and other metrics and compare them with existing data on AI-generated content and human content to help flag content in the provided text. Although you can’t exactly see all of this, you can experiment with HireQuotient’s AI detector and maybe see if you can add it as a part of your existing workflow.
Keep an eye for more news & updates on Tech Sky!