This looks to be an interesting piece on a very interesting paper! Somewhat tangentially (I'm afraid) I just wanted to comment on this para from the article's intro:
> Language is made of discrete structures, yet neural networks operate on continuous data: vectors in high-dimensional space. A successful language-processing network must translate this symbolic information into some kind of geometric representation
I was a bit surprised recently by another article linked here recently[1] that discusses "direct speech-to-speech translation without relying on intermediate text representation" which (if I read it correctly) works by taking frequency domain representations of speech as input and producing frequency domain representations of translated speech as output. This is indeed as near as you get to "continuous" input and output data in the digital domain, and brings into question (in my mind, anyhow) the assumption that discrete structures are fundamental to language processing (in humans too, for that matter.)
I don't mean to detract from the paper, which looks highly interesting, it's just that this business of given discrete structures in language is a bugbear of mine for some time now :)
I’m impressed by the method of mapping higher dimensional vectors to a consistent tree representation, but I’m not sure what the take home point is after that. The BERT embeddings are (possibly randomly) branching structures? I’m only eyeballing figure 5 here, but the BERT embeddings only approximate the dependency parse tree to the same extent that the random trees do.
Figure 5(c) illustrates the shape of the projections for a random branching embedding of the correct tree structure. This roughly matches the ideal Pythagorean embedding, and also the BERT embedding. Keep in mind that BERT only sees word sequences, with no explicit notion of a tree structure. In theory, there are O(N^3) possible parse trees, which are not completely arbitrary graphs, but rather have have a context-free structure. Thus figure 5(d) is too weak, with embeddings are picked completely randomly, with no tree-based constructive process. I wish there were figure 5(e) showing random branching embedding of a random parse tree, to give a sense of how much randomly embedding the right parse tree vs. randomly embedding some random parse tree influences the final result. The hard problem in parsing in finding the right tree...
A huge question in NLP is how the discrete symbolic structures that we think characterize natural language can be embedded in high-dimensional continuous space. This paper proposes a solution that justifies some ad-hoc results from before and could form the basis for better means of embedding in the future.
> Language is made of discrete structures, yet neural networks operate on continuous data: vectors in high-dimensional space. A successful language-processing network must translate this symbolic information into some kind of geometric representation
I was a bit surprised recently by another article linked here recently[1] that discusses "direct speech-to-speech translation without relying on intermediate text representation" which (if I read it correctly) works by taking frequency domain representations of speech as input and producing frequency domain representations of translated speech as output. This is indeed as near as you get to "continuous" input and output data in the digital domain, and brings into question (in my mind, anyhow) the assumption that discrete structures are fundamental to language processing (in humans too, for that matter.)
I don't mean to detract from the paper, which looks highly interesting, it's just that this business of given discrete structures in language is a bugbear of mine for some time now :)
1: https://ai.googleblog.com/2019/05/introducing-translatotron-...